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SUMMARY 


The  joo  performance  measurement  literature  indicates  that  previous  research  relied  heavily  on 
broad-based  generic  Indices,  performance  ratings,  or  operational  measures  with  their  inherent 
problems  of  inflation  and  halo  effects.  These  broad  measures  were  unable  to  take  into  account 
task-level-specific  influences  such  as  training  differences  or  opportunities  to  perform;  hence, 
such  efforts  have  been  largely  unsuccessful.  However,  it  appears  that  current  interest, 
resources,  and  state-of-the-art  technology  developments  have  now  significantly  increased  the 
probability  of  developing  successful  measures  of  job  performance.  This  report  describes  the  Air 
Force  Human  Resources  Laboratory's  (AFHRL)  research  program  for  development  of  individual  job 
performance  measures.  The  report  describes  the  construction  of  a  job  performance  measurement 
classification  scheme  into  which  the  relevant  empirical  and  theoretical  literature  are 
organized.  Based  on  this  framework,  specific  recommendations  for  both  applications  and  research 
directions  are  given. 


PREFACE 


This  report  describes  the  Initiation  of  a  long-term  program  of  research  and 
development  (RAD)  focusing  on  job  performance  criterion  development .  The  work  was 
performed  by  McFann-Gray  and  Associates,  Inc,,  under  contract  F41 669-81  -C-0022  with  the 
Air  Force  Hunan  Resources  Laboratory  (AFHRL),  Manpower  and  Personnel  Division.  The  work 
was  accomplished  under  Work  Unit  77191821  .  Dr.  R.  Bruce  Gould  was  the  AFHRL  Contract 
Monitor. 

Several  Influences  have  highlighted  the  Air  Force's  need  for  performance 
measurement  and  brought  ongoing  and  planned  programs  to  their  current  state.  Planning 
for  the  research  program  began  several  years  ago  on  the  recommendation  of  two  Research 
Advisory  Panels  (composed  of  knowledgeable  scientists  from  academia  and  Industry,  as 
well  as  peers  from  the  Army  and  Navy).  They  reviewed  the  entire  AFHRL  manpower, 
personnel,  and  training  research  program  and  recommended  consolidation  of  separate 
measurement  efforts  Into  one  unified  research  program.  At  the  same  time,  the  Uniform 
Guidelines  for  Employee  Selection  (1978)  and  a  review  of  case  law  mandated  that  Air 
Force  civilian  selection  systems  be  validated  against  job  performance  measures. 

Finally,  Congress  mandated  that  military  selection  tests  be  validated  against  hands-on 
job  performance  measures.  These  operational,  legal,  and  Congressional  mandates  have 
thus  provided  the  Impetus  to  planning  and  obtaining  support  for  a  lengthy,  high  resource 
research  effort. 

The  short-term  objective  of  this  effort  Is  the  development  of  on-ihe-job 

performance  measures  to  validate  Air  Force  selection  ano  classification  procedures. 
Guidelines  for  developing  and  obtaining  the  performance  measures  will  be  established  for 
a  wide  range  of  enlisted,  officer,  and  civilian  jobs.  Once  obtained,  the  measures  will 
be  placed  In  a  data  base  for  validation  use.  The  long-term  goal  Is  to  establish  an 
operational  performance  measurement  program  for  evaluation  of  selection  and  training 

procedures,  as  well  as  personnel  policies  and  practices.  The  goal  here  Is  to 

operationalize  the  procedures  so  that  performance  measurement  and  evaluation  can  be 
carried  on  by  technicians,  as  Is  currently  done  by  the  USAF  Occupational  Measurement 
Center  with  the  Occupational  Survey  (Job  Analysis)  Program.  In  this  way,  RAD  resources 
will  be  freed  for  other  projects. 
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JOB  PERFORMANCE  MEASUREMENT  IN  THE  MILITARY: 
A  CLASSIFICATION  SCHEME,  LITERATURE  REVIEW, 
AND  DIRECTIONS  FOR  RESEARCH 


I.  INTRODUCTION 

The  major  purpose  of  this  report  is  to  describe  a  job  performance  measurement  classification 
scheme,  with  emphasis  on  its  applicability  in  a  military  context.  The  field  of  job  performance 
measurement  has  probably  generated  more  literature  in  the  behavioral  sciences  than  has  any  other 
topic,  yet  there  does  not  yet  exist  a  complete  conceptual  framework  for  this  phenomenon.  The 
works  of  DeCotiis  and  Petit  (1978)  and  Wherry  and  Bartlett  (1982)  represent  the  most  significant 
efforts  at  providing  partial  conceptual  frameworks,  and  these  will  be  reviewed  in  more  detail 
later  in  this  report.  The  importance  of  these  two  conceptualizations  to  this  effort  is  that  they 
share  the  same  perspective  that  accuracy  of  the  performance  evaluation  is  the  most  critical 
indicator  of  the  quality  of  the  measurement. 

The  lack  of  a  complete  conceptual  framework  for  the  measurement  of  job  performance  is  central 
to  the  problem  of  properly  specifying  and  measuring  dependent  variables  in  both  causal  and 
covariate  designs.  This  is  particularly  problematic  for  the  applied  researcher  who  is  concerned 
with  understanding  and  predicting  the  behavior  of  people  in  organizations.  Within  the  military 
context,  this  problem  becomes  more  acute  for  scientists  involved  in  research  and  recommendations 
for  action  in  any  of  the  traditional  personnel  decision  functions.  Thus,  a  major  outcome  of  this 
report  will  be  a  conceptually  based  descriptive  classification  scheme  of  performance  measurement 
variables  that  may  be  used  (a)  to  summarize  and  organize  research  progress  in  terms  of  previous 
empirical  work  and  (b)  to  Identify  future  research  and  development  (R&O)  needs.  These  two 
outcomes  should  prove  helpful  to  the  long-term  R«0  program  being  initiated  by  the  Air  Force  Human 
Resources  Laboratory  (AFHRL)  to  develop  a  methodology  for  measuring  job  performance  in  the 
military. 


1.1  Organization  of  Report 

The  four  chapters  that  follow  will  provide:  (a)  a  description  of  a  conceptual  performance 
measurement  classification  scheme;  (b)  an  examination  of  the  empirical  and  theoretical 
literature  relevant  to  the  variables  and  relationships  identified  in  the  schema;  (c)  specific 
recommendations  for  both  applications  and  research  directions;  and  (d)  priorities  related  to 
research  directions. 

The  second  chapter  is  an  integration  and,  by  necessity,  a  deductive  extension  of  previous 
attempts  to  provide  conceptual  descriptions  of  parts  of  the  performance  measurement  situation 
(see  for  example,  Cummings  &  Schwab,  1973;  DeCotiis  &  Petit,  1978;  Kavanagh,  1982a;  Landy  &  Farr, 
1980;  MacKinney,  1967;  Ronan  j.  Prien,  1971;  Wherry  &  Bartlett,  1982).  The  integration  of 
previous  conceptualizations  is  necessary  because  none  completely  describes  all  aspects  of  the 
performance  measurement  situation  as  envisioned  in  this  report.  The  second  chapter  provides  a 
conceptually  based  descriptive  classification  scheme  that  serves  as  a  mechanism  for  organizing 
the  literature  review  and  prescribing  needed  research. 

The  third  chapter  examines  the  literature  on  performance  measurement.  Computer  searches  of 
both  the  public  (e.g..  Psych  SCAN)  and  Department  of  Defense  (DOD)  literature  were  conducted  to 
identify  as  much  of  the  relevant  literature  as  possible.  In  addition,  behavioral  scientists 
identified  with  the  performance  measurement  literature  were  contacted  in  an  attempt  to  uncover 
current,  unpublished  studies  related  to  this  topic.  This  chapter  represents  a  first  attempt  to 
verify  the  relationships  or  linkages  hypothesized  in  the  classification  scheme  and  provides  a 


summary  showing  the  empirical  support  (or  non-support}  for  these  relationships.  This  literature 
review  serves  as  a  basis  for  revision  of  the  schema  and  provides  direction  for  AFHRL's  program  of 
research. 


The  fourth  and  fifth  chapters  are  most  important,  in  that  they  can  be  used  as  guides  for  the 
long-term  RiO  program  within  AFHRL  to  develop  a  measurement  methodology  for  job  performance. 

The  fourth  chapter,  organized  by  the  linkages  in  the  model,  contains  recommendations  both  for 
specific  features  to  include  in  the  design  of  the  measurement  methodology  and  for  specific  areas 
where  research  is  needed.  The  reconmendations  will  help  to  conserve  AFHRL  resources  by  also 
specifying  where  research  is  not  necessary  (i.e.,  where  prescriptive  advice  exists  in.  the 
literature) . 

The  final  chapter  provides  recommendations  for  research  that  are  prioritized  in  terms  of 
their  importance  to  the  overall  program  of  R&D  at  AFHRL,  presented  in  chronological  order  to 
serve  as  a  planning  tool,  and  integrated  within  the  conceptually  based  classification  scheme  in 
this  report. 


1.2  Terminology 

Before  proceeding  further,  it  is  important  to  define  and  differentiate  among  the  various 
terms  used  in  the  field  of  performance  measurement.  Criterion,  one  of  the  most  commonly  used 
terms  in  the  field,  refers  to  a  measure  of  performance.  In  the  context  of  this  report,  a 
criterion  is  a  measure  of  an  individual's  performance  on  a  job.  Performance  measure  is 
essentially  the  same  as  criterion  for  the  purposes  of  this  report,  and  these  two  terms  will  be 
used  interchangeably;  however,  it  is  possible  to  have  more  than  one  performance  measure.  The 
different  performance  measures  are  sometimes  referred  to  as  dimensions  of  performance;  these 
terms  will  be  used  interchangeably  in  this  report. 

Performance  measures  can  vary  in  several  ways.  First,  they  can  be  of  differing  complexity 
(e.g.,  a  simple  count  of  the  number  of  defects  on  an  inspection,  or  a  supervisory  rating  of  the 
leadership  quality  of  a  subordinate).  Performance  measures  can  also  vary  in  terms  of  objectivity 
versus  subjectivity.  In  the  previous  example,  count  of  defects  is  fairly  objective  and  easily 
quantified,  whereas  a  rating  of  leadership  quality  involves  more  subjective  processes  and  is  less 
easily  quantified.  It  is  important  not  to  confuse  objectivity-subjectivity  with  the  amount  of 
judgment  used  to  define  the  performance  measure.  A  count  of  defects  requires  a  considerable 
evaluative  judgment  before  a  clerk  can  make  a  tally.  In  this  report,  objective  and  subjective 
measures  will  be  used  within  this  definition,  and  no  degree  of  judgment  will  be  implied  by  either 
term.  Subjective  job  performance  measures  are  typically  called  performance  ratings,  and  this 
convention  will  be  followed.  Objective  performance  measures  are  sometimes  called  production  or 
productivity  measures  or  records;  however,  that  usage  is  somewhat  erroneous.  Production  or 
productivity  Is  much  too  general  a  term  (as  will  be  discussed  later)  to  equate  with  the 
narrowness  of  objective  performance  measures.  Further,  it  is  obvious  that  subjective  performance 
measures  also  indicate  something  important  about  an  individual's  productivity. 

Finally,  performance  measures  can  vary  in  terms  of  the  degree  of  control  the  individual  has 
over  altering  personal  performance  on  the  measures.  If  there  exist  constraints  on  performance 
due  to  inadequate  technology  or  supplies,  for  example,  the  individual  can  do  little  to  affect 
performance  on  the  measure.  However,  one  would  be  hard-pressed  to  say  an  individual  has  similar 
constraints  on  a  performance  dimension  such  as  personal  appearance.  Although  there  is  no  special 
terminology  to  differentiate  performance  measures  on  this  continuum,  this  distinction  will  have 
considerable  importance  for  establishing  the  foundation  of  the  conceptual  model. 


The  terms  performance  measurement,  performance  evaluation,  and  performance  appraisal  are  used 
interchangeably  in  this  report.  They  all  refer  to  the  process  by  which  performance  measures  are 
generated.  Kavanagh  (1982b)  has  defined  them  as  "the  process,  for  a  defined  purpose,  that 
Involves  the  systematic  measurement  of  individual  differences  in  employees'  performance  on  their 
jobs"  (p.  192).  This  definition  Is  consistent  with  other  definitions  found  In  the  literature. 

It  is  Important  to  distinguish  between  a  performance  measure,  performance  measurement,  and  a 
performance  measurement  system.  The  performance  measure  Is  the  outcome  of  the  process  defined  as 
performance  measurement.  The  performance  measurement  system  involves  all  components  of  the 
^performance  measurement  function  within  an  organization.  Thus,  supervisory  training  to  use  the 
measures  and  the  measurement  process,  administrative  procedures  for  administering  and  maintaining 
the  measurement,  and  the  relationships  between  the  performance  appraisal  system  and  other 
personnel  systems  are  all  included  in  this  concept.  Performance  measurement  system,  performance 
evaluation  system,  and  performance  appraisal  system  are  typically  used  interchangeably  with 
performance  evaluation  program.  In  order  to  avoid  confusion,  only  the  term  performance  appraisal 
system  will  be  used  in  this  report  to  refer  to  the  total  function  involved  In  measuring 
individual  job  performance. 

The  final  point  to  be  made  here  involves  the  use  of  the  word  productivity.  Often, 
productivity  measures  are  used  interchangeably  with  objective  performance  measures.  This  is  much 
too  narrow,  and  productivity  will  not  be  used  in  this  manner.  Productivity  is  a  more  general 
term,  and  close  to  what  Kavanagh  (1982b)  has  defined  as  jc  performance.  "Job  performance  is  a 
dynamic,  multidimensional  construct,  assumed  to  Indicate  an  employee's  behavior  in  executing  the 
requirements  of  a  given  organizational  role"  (p.  195).  The  term  job  performance  will  be  used  in 
this  report,  following  the  meaning  described  above,  to  avoid  confusion  with  other  uses  of  the 
term  productivity  in  the  literature. 


i 
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1.3  Background 

Previous  applied  research  efforts  In  the  military  have  traditionally  not  used  job  performance 
as  the  dependent  variable  for  validating  personnel  procedures  and  decisions.  Typically,  training 
school  grades  have  been  employed  to  validate  the  Armed  Services  Vocational  Aptitude  Battery 
(ASVAB)  and  its  predecessor  selection  and  classification  tests.  Altnough  training  school  success 
is  an  Important  dependent  variable  in  the  military  human  resources  management  system,  it  is  an 
intermediate  criterion  of  the  effectiveness  of  personnel  selection.  The  crucial  question  that 
remains  is  whether  the  scores  on  the  ASVAB  can  successfully  predict  Individual  effectiveness  in 
job  performance  once  the  person  is  on  the  job.  It  Is  Important  to  note  that  this  logic  applies 
not  only  to  the  validation  of  the  ASVAB  but  to  applied  research  efforts  Involving  decisions 
within  the  human  resources  management  system.  For  example,  the  empirical  question  of  whether 
females  can  perform  as  well  as  males  in  traditionally  non-female  jobs  in  the  military  cannot  be 
answered  using  training  school  success  only.  Nor  can  the  effectiveness  of  a  placement/transfer 
system  be  evaluated  only  In  terms  of  personal  adjustment  and  time  to  proficiency  In  the  new  job. 
Clearly,  a  measure  of  individual  effectiveness  on  the  job  is  needed  to  validate  such  personnel 
decisions.  However,  the  focal  example  used  throughout  the  remainder  of  this  report  will  be  the 
validation  of  the  ASVAB. 

If  there  exists  a  need  for  criterion  measurement,  why  not  use  the  performance  appraisals 
already  available  In  the  military?  The  well-documented  problems  of  leniency  errors  (or  effects) 
and  reduced  variance  for  these  ratings  have  limited  their  usefulness  for  validation  efforts. 
More  critically,  the  performance  appraisals  currently  In  use  In  the  military  are  primarily 
destgned  for  administrative  actions  (i.e.,  promotions).  As  such,  they  are  subject  to  "gaming," 
which  can  distort  the  true  score  an  individual  should  receive  In  terms  of  Job  performance.  In 
order  to  evaluate  the  validity  of  the  A5VAB,  It  is  necessary  to  develop  a  performance  measurement 
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methodology  for  research  purposes  only,  so  as  to  better  estimate  actual  performance  levels  of 
individual  airmen.  The  literature  has  shown  that  data  gathered  for  research  purposes,  as  opposed 
to  data  collected  for  administrative  uses,  contains  less  distortion  and,  more  Importantly,  shows 
greater  variance. 

Thus,  the  need  exists  for  a  criterion  measurement  methodology  to  index  individual 
effectiveness  on  military  jobs  for  use  in  validation  research.  Over  the  past  2  years,  technical 
reviews  of  the  R&D  programs  of  the  AFHRL  have  consistently  noted  that  this  effort  should  not 
"start  from  scratch."  The  large  volume  of  previous  research  on  the  "criterion  problem,"  as  well 
as  the  increased  volume  of  available  data  as  a  result  of  the  passage  of  the  Civil  Service  Reform 
Act  (CSRA)  and  recent  Equal  Employment  Opportunity  (EEO)  court  decisions,  will  serve  as 
guidelines  to  enable  the  criterion  development  research  to  be  accomplished  in  a  relatively 
efficient  and  cost-effective  manner.  The  purpose  of  this  report,  as  noted  earlier,  is  to  provide 
a  conceptual  framework  to  guide  this  effort. 


1.4  Classification  Scheme  Boundaries 


To  provide  a  clear  focus  for  the  classification  scheme,  it  is  necessary  to  describe  the 
boundary  conditions  that  will  be  used. 

These  conditions  are: 

1.  The  classification  schema  focuses  on  performance  measurement  In  the  military. 

2.  The  schema  describes  the  case  in  which  performance  measurement  is  being  used  for  research 
purposes  only. 

3.  The  schema  considers  all  variables  that  affect  performance  measurement:  organizational, 
situational,  group,  dyadic,  and  individual. 

In  the  military  context,  there  may  be  performance  variables  that  do  not  appear  in  non-military 
settings  (e.g.,  those  concerned  with  weapons  maintenance  and  use,  and  with  combat 
effectiveness).  Also,  some  variables  in  the  model  may  have  greater  salience  than  in  the 
non-military  context.  In  the  military  environment,  too,  the  variance  in  performance  ratings  Is 
likely  to  be  greater  than  that  found  in  non-military  contexts.  In  non-military  contexts,  there 
is  usually  more  nre"“Selection,  end  es  e  result,  the  verlence  on  the  selected  eptitudes  for  jobs 
is  smaller  than  that  typically  found  in  the  military. 

The  fact  that  the  performance  measures  will  be  used  for  validation  research  only  will  likely 
change  the  Impact  of  the  different  variables  (e.g.,  Zedeck  &  Cascio,  1982).  In  our  schema,  this 
means  that  the  variables,  their  interrelationships,  and  their  salience  would  change  depending  on 
the  purpose  of  the  measurement.  In  terms  of  a  regression  analogy.  It  would  be  expected  that  the 
beta  weights  for  the  Independent  variables  would  change  as  a  function  of  whether  the  performance 
measures  were  to  be  used  for  employee  growth  and  development,  administrative,  or  research 
purposes.  For  example,  the  relation  between  pay  and  performance  would  be  highly  salient  In  a 
schema  that  was  concerned  with  the  use  of  measures  for  administrative  purposes,  but  probably  less 
salient  within  either  a  growth  and  development  or  validation  framework. 

The  third  boundary  condition  Is  not  a  constraint;  rather,  it  is  an  extension  of  previous 
performance  measurement  schemas.  Other  classif ’cation  schemes  are  typically  more  micro  In  their 
perspective.  For  example,  some  schemas,  explicitly  or  Implicitly,  concern  only  the  cognitive 
processes  of  the  rater  and  their  effect  on  the  quality  of  the  measures.  Other  approaches  are 
concerned  with  the  dyadic  relationship  between  the  rater  and  the  ratee.  The  present 
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classification  scheme  will  be  more  macro*  and  include  all  of  the  relevant  variables  that  affect 
the  measurement  of  individual  job  performance. 


II.  A  CONCEPTUAL!. V  BASED  DESCRIPTIVE  CLASSIFICATION  SCHEME  OF 
PERFORMANCE  MEASUREMENT  QUALITY 


Based  on  the  various  considerations  outlined  in  Section  I,  and  an  examination  of  the 
literature  from  the  behavioral  sciences,  an  approach  to  development  of  a  classification  scheme 
was  selected  that  involves  the  following  considerations: 


1.  Variables  were  included,  on  the  basis  of  the  theoretical  and  empirical  literature,  that 
could  affect  either  job  performance  or  the  measurement  of  job  performance. 

2.  Classical  test  score  theory,  with  its  emphasis  on  true  and  error  variance  in  observed 
scores,  provided  a  general  perspective. 

3.  Rather  than  including  detailed  individual  variables,  these  variables  were  classified  into 
categories  for  ease  of  presentation. 

4.  An  iterative  process  was  used,  beginning  with  a  general  schema  of  job  performance  and 
ending  with  a  job  performance  measurement  classification  scheme  for  validation  purposes. 

5.  The  applicability  of  the  classification  scheme  for  use  in  a  military  setting  was  an 
overriding  concern. 

These  considerations  will  be  discussed  as  the  schema  is  described. 


Before  describing  the  development  process,  we  wish  to  emphasize  that  the  frameworks  for  the 
schema,  which  were  derived  from  the  theoretical  and  empirical  literature,  are  descriptive  rather 
than  prescriptive,  because,  in  our  judgment,  the  causal  linkages  hypothesized  in  the  scheme  are 
Incomplete.  This  does  not  mean  that  no  research  evidence  exists,  but  rather,  that  further 

research  is  necessary  before  the  sciteme  can  be  classified  as  prescriptive.  As  will  be  seen  in 

later  sections  of  this  report,  there  are  a  variety  of  research  findings  that  impact  directly  and 
Indirectly  on  the  hypothesized  linkages  of  the  schema,  and  some  of  this  literature  can  provide 
prescriptive  advice  for  the  development  of  a  measurement  methodology  for  job  performance. 

Perhaps  the  major  reason  for  a  conservative  stance  lies  in  the  definition  of  measurement 

quality  used  in  this  report.  As  will  be  discussed  in  more  detail,  we  consider  accuracy  and 

construct  validity  as  th$  primary  criteria  for  evaluating  the  quality  of  measurement  when  the 
purpose  of  the  measurement  is  for  validation  research.  Although  other  criteria  of  measurement 
quality  (e.g.,  halo  and  leniency)  have  been  used  extensively  to  judge  the  "goodness"  of  job 
performance  measures,  we  believe  they  have  less  relevance  for  "research  only"  performance 
measurement. 


2.1  A  Simplified  Job  Performance  Schema 

Prior  to  the  development  of  a  framework  describing  the  measurement  of  job  performance,  it  was 
necessary  to  develop  a  schema  of  job  performance  as  a  first  step.  It  was  important  to  Identify 
all  variables  that  could  potentially  Impact  on  a  person's  job  performance,  since  these  same 
variables  could  be  Important  sources  of  true  or  error  variance  in  the  measurement  of  job 
performance.  An  examination  of  theories  in  the  area  of  work  motivation  (cf.  Steers  &  Porter, 


1979)  showed  them  to  be  conceptually  comprehensive,  but  lacking  In  detail  In  terms  of  the 
specific  variables  that  affect  job  performance.  However,  using  these  general  models  and  others 
from  the  organization  behavior  literature  (e.g.,  Naylor,  Pritchard,  ft  Ilgen,  1980),  a  general 
schema  of  the  variables  that  impact  on  Individual  job  performance  was  constructed  (Figure  1). 


Figure  1.  A  Simplified  Job  Performance  Scheme. 

A  brief  description  of  the  schema  depicted  In  Figure  1  will  suffice  to  provide  the  Interested 
reader  the  opportunity  to  look  more  deeply  Into  the  theoretical  underpinnings  of  the  scheme. 

Starting  on  the  left  side  of  Figure  1,  the  presence  of  Individual  variables  In  the  model  Is 
axiomatic,  and  based  on  the  common  theme  In  the  motivational  literature  (Steers  ft  Porter,  1979) 
that  Individual  job  performance  Is  a  function  of  the  skills,  aptitudes,  and  effort  a  person 
brings  to  a  job.  These  variables,  according  to  Figure  1,  Indirectly  Influence  job  performance 
through  their  Impact  on  the  relationship  with  the  supervisor  (Bass,  1981;  Vroom,  1976;  Yukl, 
1981)  and  their  Interaction  with  work  group  factors  (6raen,  1976;  Hackman,  1976).  It  Is 
Important  to  note  that  non-job  variables  could  also  affect  the  Interaction  of  the  Individual 
variables  with  both  the  work  group  and  the  relationship  with  the  supervisor.  These  factors  would 
Include  such  variables  as  marital  status,  religious  preference,  and  membership  in  a  dual-career 
family  (Hamel,  1981;  Owens  ft  Champagne,  1965).  Note  that  this  class  of  variables  Is  not  on  the 
major  causal  linkage  In  the  schema,  thus  Indicating  that  these  variables  may  or  may  not  Impact  at 
this  point  In  the  model.  The  same  Is  true  for  the  organizational  factors  (Adams,  1965;  Lawler, 
1976;  Payne  ft  Pugh,  1976),  situational  constraints  (Chapanls,  1976;  Peters,  O’Connor,  ft  Rudolf, 
1980),  and  stressful  life  events  (Kavanagh,  1982a).  Although  any  of  these  factors  can  become 
quite  salient  In  terms  of  affecting  Individual  Job  performance,  they  are  not  always  operative. 

The  "simplified"  framework  In  Figure  1  Is  not  only  conceptually  and  empirically  based,  but  It 
logically  links  major  sources  of  variance  in  job  performance  together  In  a  meaningful  manner. 
Current  research  in  the  field  continues  to  explore  the  Importance  of  each  of  these  variables; 
however,  for  our  purposes,  the  differential  Impact  of  these  variables  on  Individual  job 
performance  Is  unimportant.  As  long  as  the  possibility  exists  that  each  of  the  classes  of 
variables  In  Figure  1  can  Influence  Individual  Job  performance,  then  It  Is  also  possible  that 
they  can  affect  the  quality  of  the  measurement  of  job  performance. 
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2.2  Development  of  a  General  Job  Performance  Measurement  Classification  Scheme 


The  model  presented  in  Figure  2  provides  a  general  classification  scheme  of  performance 
measurement  quality.  The  model  in  Figure  2  is  similar  to  the  figure  1  model  in  that  it  suggests 
no  direct,  isomorphic  relationship  between  a  person's  skills,  aptitudes,  and  effort  and  the 
outcome  variable.  However,  these  input  variables  are  included  since  they  can  contribute  either 
true  or  error  variance  to  performance  measurement  quality. 
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Figure  2.  A  Job  i’erformance  Measurement  Classification  Scheme. 

It  should  be  noted  that  Figure  2  is  simply  a  general  performance  measurement  quality 
classification  scheme.  A  performance  measurement  framework  for  the  purpose  of  validation 
research  only  will  be  discussed  later.  It  is  necessary  to  cover  the  more  general  case  In  order 
to  understand  the  model-building  process. 

The  general  classification  scheme  In  Figure  2  was  developed  by  focusing  only  on  those 
variables  that  impact  the  quality  of  performance  measurement.  First,  six  criteria  that  have  been 
used  to  assess  the  quality  of  measures  of  Job  performance  were  identified.  These  are  listed  In 
Table  1.  It  should  be  noted  that  the  first  criterion  is  properly  labeled  psychometric  "effects," 
not  "errors,"  which  we  feel  is  consistent  with  current  thinking  in  the  field  of  performance 
measurement  (Hakel,  1980;  Hedge,  1982;  Kavanagh,  1979). 

Table  1.  Quality  of  Performance  Measurement  Criteria 


Psychometric  effects:  halo,  leniency,  range  restriction 
Inter-rater  reliability 
Content  validity 

Dlscriminability  (in  terms  of  Individual  performance  levels) 

Construct  validity 

Accuracy 


Next,  a  literature  search  was  conducted  to  identify  the  variables  that  Impact  these  quality 
criteria.  The  variables  Identified  constitute  the  input  variables  shown  in  the  first  box  z>n  the 
left  side  of  Figure  2. 

The  process  variables  shown  In  the  center  of  Figure  2  reflect  the  current  thinking  In  the 
performance  measurement  literature  that  these  variables  play  an  Important  and  pervasive  role  in 
the  appraisal  process  (Borman,  1977;  Olpbpye  &  dePontbrland,  1981;  Feldman,  1981;  Hedge,  1982; 
Murphy,  1982).  There  has  been  a  recent  emphasis  on  cognitive  variables  (their  Importance  In  the 
decision-making  process)  (Feldman,  1901;  Landy  &  Farr,  1980),  as  well  as  the  acceptability/ 
confidence  users  have  In  the  system  (Dlpboye  &  dePontbrland,  1981;  Kavanagh  &  Hedge,  1983;  Landy, 
Barnes,  &  Murphy,  1978),  and  their  hypothesized  effects  on  measurement  quality.  In  addition,  the 
motivation  which  the  ratees  bring  to  the  appraisal  process  (OeCotlls  &  Petit,  1978)  and  their 
trust  In  the  appraisal  process  (Bernardin,  Orban,  &  Carlyle,  1981)  are  considered  Important 
process  variables.  Although  there  Is  little  empirical  evidence  In  the  literature  with  respect  to 
the  role  of  these  variables,  there  are  Indications  that  they  act  as  Intervening  process 
variables.  Thus,  although  these  Individual/system  characteristics  are  known  to  Influence 
measurement  quality,  since  they  are  hypothesized  to  be  functionally  related  to  both  the 
Independent  and  dependent  variables,  they  will  be  considered  separately. 

The  cognitive  variables  have  been  placed  outside  the  main  causal  path  since  these  variables 
may  not  always  play  an  Important  role  In  the  appraisal  process.  When  the  measurement  system 
relies  heavily  on  human  judgment,  such  as  with  ratings  or  trained  observers,  these  variables 
would  be  expected  to  influence  measurement  quality.  However,  when  human  judgment  is  not  as 
Inportant,  such  as  with  productivity  counts  or  number  of  absences,  the  Impact  of  these  variables 
would  be  greatly  reduced.  i 

In  Figure  2,  the  cognitive  variables  have  been  divided  Into  two  categories:  (a)  the  Input 
and  storage  of  information,  which  Is  primarily  concerned  with  the  observational  heuristics  that 
people  use  when  gathering  information  about  an  Individual's  job  performance;  and  (b)  the 
cognitive  processes  that  Involve  the  Judgment  or  decision  heuristics  that  people  use  In  assigning 
a  quantitative  Index  to  the  performance  of  a  person  on  the  job.  This  division  avoids  the  search 
for  a  single  cognitive  variable,  such  as  cognitive  complexity  (Bernardin  and  Cardy,  1981;  Labey  fc 
Saal,  1981),  that  relates  to  measurement  quality.  Recent  work  by  Murphy,  Garcia,  Kerkar,  Martin, 
and  Balzar  (1982)  and  Hedge  (1982)  Indicates  that  considering  observational  processes  and 
decision  processes  separately  may  help  to  better  explain  their  effects  on  the  quality  of  the 
measurement.  This  Is  also  consistent  with  Wherry's  theory  of  rating  (Wherry  t  Bartlett,  1982), 
which  postulates  that  observation  and  recall  by  the  rater  are  two  separate  components  of  the 
observed  score. 

A  general  hypothesis  underlying  this  conceptualization  is  that  the  more  complex  (l.e., 
sophisticated,  not  necessarily  cumbersome)  the  observational  and/or  decision  heuristics  used,  the 
higher  the  quality  of  the  performance  measurement.  However,  Individual/system  characteristics 
could  affect  the  complexity  of  these  cognitive  processes,  and  thus,  lower  or  raise  the  quality  of 
the  measures,  A  good  example  Is  the  Impact  of  organizational  or  unit  norms  In  the  current 
military  performance  measurement  system,  where  a  strong  norm  exists  to  give  enlisted  personnel 
high  ratings  (l.e.,  "8"  or  a  "9")  on  their  performance  evaluations.  Regardless  of  the  effuse  of 
this  norm.  Its  effect  Is  to  simplify  the  rater's  cognitive  approach;  for  whether  or  not  the  rater 
uses  complex  observational  heuristics,  the  decision  heuristic  Is  simple  —  "B"  or  "9."  The 
Impact  on  the  quality  of  ratings  Is  obvious,  and  Interestingly*  similar  results  have  been  found 
In  many  non-military  settings  where  ratings  are  used  for  administrative  purposes. 

In  the  performance  appraisal  literature,  few  studies  have  focused  on  motivation  in  the 
context  of  performance  measurement  (bernardin,  Orban,  &  Carlyle,  1981;  Bernardin  &  Cardy,  1982; 


DeCotlls  &  Petit,  1978)  or  on  trust  in  the  appraisal  process  (Bernard in,  Orban,  &  Carlyle, 
1981).  Still,  the  authors  believe  that  these  variables  play  key  roles  In  the  accuracy  of 
performance  evaluations;  thus,  both  have  been  included  as  elements  of  the  classification  scheme. 

User  acceptance  of  and  confidence  In  the  performance  measurement  sysfw  are  seen  as  crucial 
to  the  effective  operation  of  the  entire  system,  and  thus,  directly  affecting  the  quality  of  the 
measurement  (Kavanagh,  1982b;  Lawler,  1967).  Some  recent  empirical  wor-  (Dlpboye  &  dePontbrland, 
1981;  Kavanagh  &  Hedge,  1983;  Landy,  Barnes-Farrell,  &  Cleveland,  1980;  Candy,  Barnes,  l>  Murphy, 
1978)  Indicates  that  this  Is  an  important  variable  in  a  performance  np^curement  system.  In  our 
general  conceptual  framework  of  performance  measurement  quality,  all  of  the  system 
characteristics  Indirectly  affect  the  quality  of  the  measurement  tivough  their  Impact  on  the 
acceptablllty/conf idence  variable.  Clearly,  this  acceptability  varlclo  may  change  In  Importance 
depending  on  the  purposes  of  the  performance  measurement.  This  notion  Is  critical  to  the 
development  of  a  schema  for  validation  purposes  only,  and  will  be  discussed  later  In  this  section. 

Another  perspective  used  to  generate  the  classification  scheme  was  one  borrowed  from  test 
score  theory.  Spearman's  classic  test  score  model  was  selected  because  of  Its  simplicity  and 
wide  dissemination.  The  notion  that  an  observed  score,  a  performance  measurement  score.  „an  be 
divided  Into  true  and  error  components  allows  us  to  examine  the  Impact  of  those  variables  that 
affect  true  variance  and  those  that  affect  only  error  variance  In  the  performance  measurement 
situation.  During  our  literature  review.  It  became  obvious  that  one  would  want  to  minimize  those 
factors  that  affect  only  error  variance,  while  Increasing  the  Impact  of  those  factors  that 
Influence  true  variance.  This  finding  has  clear  Implications  for  future  research  strategies. 

This  approach,  based  on  test  score  theory,  is  analogous  to  that  taken  by  Wherry  (Wherry  t 
Bartlett,  1982),  although  we  prefer  to  base  our  conceptual  framework  on  the  Analysis  of  Variance 
model  of  test  scores  (Cronbach,  Gleser,  Nanda,  &  Rajaratnam,  1972).  In  Wherry's  theory,  the 
observed  rating  a  person  receives  Is  comprised  of  the  following  components:  true  Job  performance 
of  the  ratee,  environmental  Influences,  observation  and  recall  by  the  rater,  and  the  errors 
associated  with  these  factors  as  well  as  an  overall  error  term.  Although  the  test  score  model 
used  may  not  be  critical.  It  can  be  seen  from  Figure  2  that  the  classification  scheme  Is  much 
more  specific  about  the  variables  that  Impact  on  measurement  quality. 

Finally,  In  order  to  refine  the  classification  scheme,  our  approach  was  to  organize  Into 
categories  the  many  variables  that  are  known  to  affect  measurement  quality.  This  provided  a 
framework  for  conducting  the  literature  review  In  an  organized  fashion,  and  for  identifying  and 
prioritizing  AFKRL  research  needs.  This  reasonably  exhaustive  list  of  variables  Is  contained  In 
Table  2. 

As  mentioned  earlier,  perhaps  the  most  critical  input  variable  In  terms  of  Its  Impact  on 
rating  quality  Is  the  measurement  purpose.  For  example.  If  a  measurement  system  Is  to  be  used 
for  promotion  or  pay  Increases,  It  creates  an  entirely  different  context  for  the  quality  of  the 
measurement  than  If  the  system  Is  for  validation  research  purposes.  That  Is,  the  pay- performance 
relationship  with  measurement  quality  would  be  extremely  Important  In  a  measurement  system  being 
used  for  administrative  purposes,  but  would  have  little  effect  for  validation  research  purposes. 

Performance  measurement  systems  have  four  major  purposes  or  uses:  (a)  for  administrative 
decisions,  (b)  for  employee  growth  and  development,  (c)  for  validation  research,  (d)  for  meeting 
legal  guidelines.  The  strength  of  the  relationships  between  the  Individual/system  character¬ 
istics  and  measurement  quality  will  change  as  a  function  of  changing  the  purpose.  Although  these 
effects  have  been  discussed  for  some  time  (cf.  Cuirmlngs  l>  Schwab,  1973),  only  recently  has  there 
been  empirical  evidence  which  demonstrates  that  measurement  quality  Is  affected  by  the  purpose  of 
the  measurement  (Zedeck  4 t  Casclo,  1982).  A  good  analogy  would  be  to  consider  the  Individual/ 
system  characterlsltlcs  In  Figure  2  as  independent  variables,  the  Intervening  variables  as 


Table  2.  Variables  That  Can  Impact  on  Measurement  Quality 


1.  Individual  characteristics 

a.  Cognitive  variables:  rater  or  ratee 

b.  Rater/ratee  Intelligence 

c.  Rater/ratee  knowledge  of  the  Job  being  evaluated 

d.  Rater/ratee  personal  characteristics 

e.  Rater/ratee  Interpersonal  trust 

-2.  Relationship  between  ratee  and  rater/observer 

a.  Sex  congruence 

b.  Race  congruence 

c.  Job  tenure  together 

d.  Age  congruence 

e.  Off-the-job  relationship 

f.  History  of  conflict  or  cooperation 

3.  Mothod/source  of  measurement 

a.  Supervisor  ratings 

b.  Peer  ratings 

c.  Self  ratings 

d.  Subordinate  ratings 

e.  Assessment  center  (team)  ratings 

f.  Work  samples/slmulatlons 

g.  Productivity  records 


A.  Scale  development 

a.  Critical  Incidents  used 
,  b.  Based  on  job  descrlptlon/Job  requirement-, 

c.  Employee  participation 

d.  Top  management  support  during  development 

5.  Rating  scale  characteristics 

a.  Content  of  the  scale 
b„  Anchors  versus  no  anchors  ^  J 

c.  Behaviors  versus  traits 
a.  Format  type 

e.  Number  of  anchors/scale  points 

f.  Single  versus  multiple  dimensions 

g.  Scaling  metric/approach 

6.  Performance  standards/goals 
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a.  Present  or  not 

b.  Standards  versus  goals 

c.  Participate^  set  and  communicated 

d.  Specificity  of  behavior  or  accomplishment  expected 

7,  Social  context 

»  -  a.  Performance  level  of  others  in  work  ijroup 

b.  Existence  of  group  norms  .v 

c.  Rater's  status  In  group 

d.  Ratee's  status  In  group 
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Table  2.  (Concluded) 


8.  Non-work  variables 

.  a.  Marital  status 

b.  Dependent  Status 

c.  Dual-career  family 

d.  Participation  in  company  activities  off  the  job 

e.  Stressful  life  events  in  recent  past 

9.  Performance  constraints 

a  Poor  information 

b.  Equipment  efficiency 

c.  Supplies  deficiency 

d.  Time  limitations 

e.  Poor  work  environment 

10.  Organizational/unit  norms 

a.  Expectation  of  certain  level  of  performance  by  upper  management 

b.  Expectation  by  immediate  supervisor  regarding  level  of  performance 

c«  Presence  of  a  union  ' 

d.  Pay/rewards  tied  to  performance  levels  by  contract 

e.  Pay/rewards  tied  to  performance  levels  by  informal  norms 

11.  Public  relations/administrative  procedures  ■  ■*! 

a.  Required  or  not  ‘  -  ’ 

b.  Mode  of  presentation  “ 

c.  Content  of  procedures  ;  U‘J  i 
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12.  Training 

a.  Content  of  training  -  --  ■ 

b.  Format  of  training 

c.  Length  of  training 

13.  Measurement  purpose 

a.  Validation  research  only 

b.  Employee  growth  and  development 
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d.  To  meet  legal  guidelines 

14.  Performance  feedback 

a.  Required  or  not 

b.  Sources  of  feedback 

c.  Participative 

d.  Clarity  of  feedback 

e.  Frequency  of  feedback 

15.  Pay-performance  relationship 

a.  Are  they  related  in  the  system? 

b.  Equity  of  the  relationship 


moderators,  and  measurement  quality  as  the  dependent  variable  in  a  multiple  regression  equation, 
and  to  expect  the  beta  weights  to  change  for  the  various  terms  in  the  equation  as  the  purpose  of 
the  measurement  changes.  Since  the  Initial  thrust  of  the  AFHRL  program  Is  for  research 
validation  only,  we  will  now  examine  what  happens  to  the  general  framework  In  Figure  2  when  only 
this  purpose  Is  considered. 


2.3  Validation  Research  Classification  Scheme 

The  performance  measurement  classification  scheme  for  validation  research  is  depicted  in 
Figure  3.  It  should  be  understood  that  the  relationships  among  variables  are  the  same  as  those 
described  for  Figure  2.  However,  there  are  some  important  differences  between  the  two  figures. 
First,  the  measurement  purpose  Input  variable  drops  out  since  Figure  3  is  a  validation  research 
only  classification  scheme.  Likewise,  since  research  is  the  purpose,  the  performance  feedback 
and  pay-performance  variables  are  omitted  from  Figure  3.  Also,  it  should  be  noted  that  the 
acceptability  of  the  appraisal  system  variable  is  now  seen  as  less  influential,  based  on  the 
logic  that  when  job  performance  data  are  collected  for  validation  purposes  only,  user  acceptance 
may  not  be  as  serious  an  Issue.  Thus,  In  Figure  3,  while  this  variable  is  still  central  to  the 
performance  appraisal  process,  its  impact  on  measurement  quality  is  likely  to  be  reduced. 
Although  the  variables  and  their  relationships  are  comparable  to  those  described  earlier,  there 
are  several  important  aspects  of  Figure  3  that  need  to  be  discussed. 


Figure  3.  A  Job  Performance  Measurement  Classification  Scheme  for  Validation  Research. 

First  of  all,  the  dependent  variable,  measurement  quality.  Is  something  that  has  not  been 
clearly  defined  In  the  literature.  Different  researchers  have  used  differing  criteria  to  assess 
measurement  quality;  six  of  these  criteria  were  listed  In  Table  1.  Of  those  listed,  we  view 
accuracy  and  construct  validity  as  the  crucial  criteria  by  which  to  Judge  the  quality  of  the 
measurement  of  job  performance.  The  other  four  criteria  are  seen  as  Important,  but  less 
critical.  In  that  satisfying  their  requirements  does  not  guarantee  the  measure  will  be  accurate 
and  construct  valid.  On  the  contrary,  an  accurate  and  construct  valid  measure  will  In  all 
likelihood  satisfy  the  other  criteria  as  well.  This  logic  Is  consistent  with  current  theory  In 
performance  measurement  (Nunnally,  1978)  and  performance  ratings  (Wherry  &  Bartlett,  1982). 


Few  models  of  the  performance  appraisal  process  exist  in  the  literature.  Of  these,  only  two 
theoretical  approaches  were  found  that  emphasized  accuracy  as  the  crucial  criterion  of 
measurement  quality.  One  such  theory  was  advanced  by  DeCotiis  and  Petit.  (1978),  who  argued  that 
accuracy  in  ratings  is  a  function  of  rater  motivation,  rater  ability,  and  the  availability  of 
appropriate  rating  standards.  Although  most  of  the  variables  in  their  model  also  appear  In 
Figure  3,  their  emphasis  is  on  how  the  ratings  are  made,  whereas  ours  is  on  the  accuracy  of  the 
measure.  Another  difference  is  that  the  DeCotiis  and  Petit  model  concerns  only  ratings,  whereas 
our  conceptual  framework  encompasses  any  measure  that  is  used  to  assess  individual  job 
performance. 

Wherry  and  Bartlett  (1982)  also  provided  a  model  of  the  performance  appraisal  process,  but  as 
with  the  DeCotiis  and  Petit  (1978)  model,  the  model  by  Wherry  and  Bartlett  is  concerned  only  with 
ratings.  Thus,  the  present  schema  is  more  comprehensive  than  either  of  the  other  two.  However, 
as  earlier  noted,  there  is  considerable  similarity  between  the  Wherry  and  Bartlett  model  and  the 
present  schema  in  terms  of  their  bases  in  test  theory. 

Of  the  input  variables  shown  in  Figure  3,  one  that  is  critically  important  is  the  measurement 
method.  A'  with  the  measurement  purpose  variable  described  earlier,  constraints  in  terms  of 
different  methods  will  likely  affect  the  relationships  within  that  framework.  If  this  is  so,  it 
may  be  that  different  measurement  methods  are  capturing  different  parts  of  the  performance 
criterion  space.  That  is,  supervisory  ratings  may  well  be  assessing  a  different  portion  of  the 
total  job  performance  criterion  space  than  are  peer  ratings,  self  ratings,  work  sample  tests,  or 
objective  indices  of  productivity. 

This  is  not  meant  to  imply  that  there  is  no  overlap  among  these  methods  in  the  part  of  the 
criterion  space  they  measure;  however,  they  are  perhaps  measuring  some  unique  aspects  of  the 
criterion  space  that  have  been  treated  frequently  in  research  as  error.  In  the  typical  research 
paradigm  to  validate  multiple  measures  of  job  performance,  one  or  more  methods  have  been 
eliminated  because  of  low  intercorrelations  with  the  other  methods,  on  the  assumption  that  these 
low  correlations  were  the  result  of  error  in  the  measures.  In  our  schema,  we  conceptualize 
different  measurement  methods  as  measuring  different  parts  of  the  criterion  space  with  differing 
degrees  of  fidelity;  thus,  low  correlations  between  measures  may  not  indicate  error.  It  can  be 
argued  that  the  typical  validation  approach  (intercorrelations  among  methods)  may  not  be  the  best 
for  assessing  measurement  quality. 


This  discussion  raises  the  issue  of  what,  in  terms  of  performance  dimensions,  constitutes  the 
lerion  space.  Approximately  12  to  15  performance  dimensions  repeatedly  appear  In  the 
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literature.  These  dimensions  seem  to  fit  into  two  general  categories:  technical  competence 
skills  and  job-relevant  interpersonal  skills.  Although  this  may  seem  an  oversimplification.  It 
Is  supported  by  factor  analytic  studies  In  which  two  factors,  roughly  representing  these  two 
broad  skill  areas,  have  emerged  (Borman,  1981;  Borman,  Mendel,  Lamnleln  &  Rosse,  1981).  Although 
these  two  categories  are,  of  course,  multidimensional,  viewing  the  criterion  space  In  this  manner 
provides  an  effective  way  of  communicating  AFHRL  research  needs. 


In  terms  of  measuring  job  performance,  the  following  five  methods  are  the  most  frequently 
used:  (a)  supervisory  ratings,  (b)  peer  ratings,  (c)  self  ratings,  (d)  work  samples,  and  (e) 
objective  Indices  of  productivity.  The  first  three  are  widely  used  and  will  be  used  In  this 
research  effort.  However,  rather  than  using  a  tradltlor.al  work  sample  methodology,  an 
alternative  to  this  approach  will  be  developed  and  tested. 


The  new  methodology  Is  called  Walk-Through  Performance  Testing  (WTPT);  It  Is  being  developed 
specifically  for  the  R4D  program  at  AFHRL.  The  WTPT  methodology  combines  aspects  of  both 
observer  Interviewing  and  work  sampling  but,  In  addition.  Is  designed  to  overcome  certain 
limitations  associated  with  the  generic  tasks  used  with  work  sampling.  The  method  will  be 


developed  by  accessing  the  Air  Force  data  base  (see  Chrlstal,  1974)  that  contains  information  on 
the  tasks  performed  in  enlisted  specialties.  These  tasks  will  form  the  basic  content  of  the 
measurement  scale.  Test  administrators  will  be  trained  to  use  these  scales  to  evaluate  effective 
and  ineffective  -erformance  on  each  of  the  tasks.  The  interviewers  will  examine  the  job 
Incumbent  by  asking  the  person  to  perform  certain  tasks  or  explain  certain  procedures  concerning 
the  tasks  for  that  job.  They  will  than  record  the  person's  behavior  or  answers  on  a  rating 
checklist  of  tasks.  The  important  characteristic  of  this  method  is  that  the  job  is  being  reduced 
to  its  smallest  parts  at  the  task  level,  anti  will  include  not  only  a  core  set  of  tasks,  but  a 
series  of  unique  tasks  as  well.  Thus,  this  method  will  examine  job  performance  at  micro  level. 

It  is  believed  that  the  WTPT  method  will  assess,  with  a  high  degree  of  fidelity,  technical 
skills  and  competence  --  one-half  of  the  criterion  space.  In  fact,  walk-through  testing  may  be 
one  method  that  removes  the  interpersonal/social  aspects  of  the  job  situation.  However,  as 
currently  planned,  it  may  be  less  accurate  in  assessing  the  job-relevant  interpersonal  skills 
side  of  the  criterion  space.  Supervisory  ratings,  on  the  other  hand,  may  be  quite  good  at 
assessing  interpersonal  skills  but  not  very  accurate  in  measuring  technical  skills,  particularly 
If  the  job  has  had  recent  significant  changes  in  technology,  or  if  the  supervisor  has  never  had 
direct  work  experience.  All  five  methods  measure  portions  of  the  criterion  space;  however,  they 
differ  in  their  fidelity  or  accuracy  of  measurement. 

This  does  not  mean  that  one  or  more  of  the  methods  could  not  be  modified  to  assess  both  major 
parts  of  the  job  performance  criterion  space.  The  WTPT  method  could  be  modified,  for  example,  to 
measure  interpersonal  skills.  However,  this  modification  may  not  be  cost  effective  if  there  is 
another  method  that  can  accurately  assess  the  interpersonal  skills  without  modification  (e.g., 
peer  ratings).  In  summary,  the  five  methods,  as  currently  used,  assess  different  parts  of  the 
criterion  space  with  differing  degrees  of  accuracy.  AFHRL's  criterion  development  research  must 
be  designed  with  this  point  in  mind.  -  - 

This  approach  to  job  performance  measurement  makes  the  typical  multimethod  validation  study 
problematic.  Typically,  if  job  performance  is  measured  with  two  or  more  methods,  and  zero-order 
correlations  are  calculated  among  the  methods,  those  methods  showing  nonsignificant  values  are 
rejected.  However,  if,  as  has  been  argued,  the  methods  are  not  assessing  the  same  portions  of 
the  criterion  space  with  equal  fidelity,  then  there  is  no  reason  to  expect  them  to  be  correlated. 

An  extension  of  this  logic  leads  directly  to  the  idea  of  specifying  the  construct  space  for 
job  performance  in  terms  of  what  Cronbach  and  Meehl  (1955)  have  termed  a  nomoltglcal  network  (a 
network  of  relations  that  are  tied  to  observables  and  hence  are  empirically  testable).  In  this 
framework  the  measures  are  the  observables,  and  the  construct  Is  used  to  account  for 
relationships  among  them.  This  suggests  the  use  of  Nunnally's  (1978)  technique  for  construct 
validation;  namely,  that  of  testing  the  ^  priori  hypothesized  relationships  within  a  construct 
space  with  empirical  data.  To  do  this,  the  two  major  parts  of  the  criterion  space,  technical 
competence  and  Interpersonal  skills,  must  be  better  specified  In  terms  of  their  job  performance 
dimensions.  ..  .  .?  5 

Having  delineated  a  multidlmension-multlmethod  matrix,  the  next  step  in  this  research 
strategy  would  be  to  hypothesize  the  expected  level  of  relationship  between  each  method-dimension 
and  all  others.  In  this  manner,  one  would  have  specified,  a  priori,  the  hypothesized  nowological 
net  for  these  methods  and  the  criterion  space.  After  collecting  data,  the  results  would  then  be 
examined  to  verify  the  expected  correlations.  In  this  strategy,  a  zero  relationship  would  be  as 
important  as  a  non-zero  one  In  establishing  the  construct  validity  of  the  methods  of  measurement. 

For  the  Air  Force  R&0  program,  empirical  construct  validation  cannot  be  accomplished  until  at 
least  the  fifth  or  sixth  year  of  the  effort.  First,  the  methods  must  be  properly  researched  and 


refined.  There  Is  a  tremendous  amount  of  research  to  be  done  during  the  first  4  or  5  years  of 
the  program  before  this  type  of  study  can  be  conducted. 


2.4  Additional  Considerations  for  the  AFHRL  Program 

Among  the  research  issues  to  be  investigated  prior  to  construct  validation  are  two  major 
considerations.  The  first  of  these  concerns  the  WTPT  methodology  for  assessing  job  performance. 
The  second  concerns  the  Incorporation  into  the  schema  of  an  additional  variable,  time  to 
proficiency. 

As  described  earlier,  WTPT  is  a  combination  of  work  sampling,  simulation,  and  interviewing, 
which  assesses  job  performance  at  the  micro-task  level.  When  fully  developed,  it  is  expected  to 
have  the  highest  fidelity  for  assessing  some  aspects  of  job  performance.  Thus,  it  can  serve  as  a 
standard  against  which  to  judge  the  appropriateness  of  other  methods  for  evaluating  technical 
competence. 

A  research  strategy  underlying  the  AFHRL  program  might  be  referred  to  as  successive 
approximations  to  high-f icielity  measures  of  job  performance.  As  currently  envisioned,  the  WTPT 
method  should  accurately  assess  the  technical  job  skills  of  individuals.  However,  this  method  is 
quite  time-consuming  and  expensive,  particularly  for  large-scale  data  collection  across  the  Armed 
Forces.  If  research  proves  that  WTPT  does  indeed  have  high  fidelity  for  technical  skills,  we  can 
then  determine  which  of  the  less  expensive  and  time-consuming  methods  of  data  collection  on  job 
performance  most  closely  approximate  the  WTPT,  and  can  be  used  instead.  Also,  as  earlier  noted, 
if  the  WTPT  method  can  be  modified  to  accurately  measure  interpersonal  skills,  the  same  research 
strategy  of  successive  approximation  with  less  costly  and  time-consuming  methods  can  be  used  to 
tap  this  portion  of  the  job  criterion  space. 

Secondly,  an  additio.-l  variable,  time  to  proficiency  on  job  tasks,  needs  to  be  incorporated 
into  the  RM)  program.  Research  is  necessary  to  determine  how  to  best  adapt  the  five  methods  to 
measure  this  crucial  part  of  job  performance.  A  wide  range  of  individual  differences  on  this 
variable  likely  exists  among  newly  assigned  personnel,  particularly  in  their  first  job  in  the 
military.  Furthermore,  it  is  likely  that  one  or  more  of  the  methods  can  measure  this  variable 
across  task  performance  with  greater  degrees  of  f idel ity/ accuracy.  In  fact,  there  probably  is  a 
task/dimension  by  method  effect  on  the  accuracy  of  assessing  time  to  proficiency.  The  important 
point  is  that  this  variable  must  be  considered  in  any  research  effort. 

in  this  chapter,  the  development  of  a  conceptually  based  classlf ication  scheme  of  performance 
measurement  quality  for  validation  research  was  described.  This  scheme,  depicted  in  Figure  3, 
will  be  used  to  summarize  and  organize  previous  research  as  well  as  to  specify  needed  future 
research.  If  one  draws  a  line  between  any  of  the  variables  (either  input  or  intervening  process 
variables)  and  the  dependent  variable,  a  linkage  in  the  schema  is  defined.  In  Chapter  III,  the 
literature  will  be  categorized  according  to  the  appropriate  linkage  in  the  schema,  and 
conclusions  regarding  the  known  empirical  "facts"  within  each  linkage  will  be  drawn.  This  will 
allow  a  specification  of  what  research  is  still  reeded  for  effective  criterion  development  in 
validation  work  (Chapter  IV).  *  ,  • v  m 

III.  LITERATURE  REVIEW  '  ' 
3.1  Introduction  1 

As  previously  mentioned,  an  extensive  literature  search  was  conducted,  resulting  In  a 
voluminous  collection  of  citations  and  abstracts.  The  review  was  concerned  only  with  literature 
on  the  assessment  of  the  quality  of  job  performance  measurement  systems.  ~  ;  ' 
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The  focus  of  the  review  was  on  behavioral  literature,  preferably  empirical  research, 
concerned  with  investigating  the  hypothesized  linkages  in  the  schema  in  Figure  3.  This  was 
accomplished  to  identify  well-documented  “facts, “  which  could  form  the  basis  of  prescriptive 
advice  for  the  AFHRL  RM)  program.  That  is,  when  we  prescribe  that  a  given  system  characteristic 
must  be  Included  in  the  measurement  methodology,  there  is  no  need  for  additional  research  by  the 
AFHRL  to  "re-establish"  this  finding.  On  the  other  hand,  the  literature  review  also  identified 
where  research  is  needed  for  the  AFHRL  program. 

Although  the  review  was  limited  to  literature  on  measurement  system  quality  within  the 
performance  appraisal  field,  the  entire  scope  of  available  writings  was  considered.  In  some 
cases,  the  search  was  simplified  by  identifying  previously  published  reviews  (Kane  l  Lawler, 
1979;  Kavanagh,  1971;  Lanoy  &  Fsrr,  1980;  Lewin  6  Zwany,  1976;  Mate  &  West,  1982;  Schmidt  6 
Kaplan,  1971;  Smith  1976).  Some  o*  literature  covered  by  these  reviews  was  not  relevant  to 
the  purposes  of  this  report,  but  where  it  was,  the  review  was  used  as  a  secondary  source. 


3.2  Individual  Characteristics  and  Measurement  Quality 

This  linkage  concerns  the  differences  between  raters  or  observers  (in  the  KTPT  method)  that 
can  impact  on  the  accuracy  of  the  measures  of  individual  job  performance. 

Eighteen  studies  were  Identified  that  examined  relationships  between  rater/observer 
characteristics  and  measurement  quality.  Two  studies  (Borman,  1977;  Mullins  &  Force,  1962)  found 
consistency  in  rating  accuracy  across  two  different  jobs  and  two  different  Job  performance 
dimensions,  respectively.  Murphy,  Garcia,  Kerkar,  Martin,  and  Balzar  (1982)  found  that  accuracy 
in  observing  behaviors  Is  related  to  accuracy  in  evaluating  performance;  however,  they  noted  that 
this  relationship  may  be  more  complex  than  their  results  showed.  This  study  will  be  examined 
more  closely  when  rater  cognitive  variables  are  discussed.  It  does  appear,  nonetheless,  that 
there  are  Important  Individual  differences  In  raters  that  affect  their  rating  accuracy. 

The  major  study  investigating  the  relationship  between  Individual  differences  and  rating 
accuracy  was  done  by  Borman  (1979a).  Borman  found  that  12  personal  characteristics  correlated 
significantly  with  rating  accuracy.  The  most  consistently  high  correlations  were  between 
accuracy  and  (a)  intelligence,  ( b)  personal  adjustment,  and  (c)  detail  orientation.  It  is 
Important  to  note  that  the  variance  in  accuracy  accounted  for  by  all  12  traits  was  17X, 
suggesting  that  individual  differences  play  a  significant  role  in  determining  rating  accuracy. 
In  a  meta-analysis  of  self-evaluation  studies,  Mabe  and  West  (1982)  found  that  Intelligence, 
achievement  motivation,  and  internal  focus  of  control  were  associated  with  accuracy  In 
self-evaluations.  Other  research  has  found  that  rating  quality  Is  related  to:  carefulness  and 
decisiveness  of  the  rater  (Mullins,  Seldllng,  Wllbourn,  &  Earles,  1979),  learned  associations 
among  sets  of  behavioral  and  personality  descriptors  (Hakel,  1974),  and  Interpersonal  trust 
(Kavanagh,  Vance,  l>  Wright,  1982).  Clearly,  the  individual  traits  and  characteristics  of  a  rater 
are  related  to  measurement  quality. 

In  a  slightly  different  vein,  two  recent  studies  Investigated  Individual  differences  derived 
from  laboratory  data  versus  those  derived  from  field  data.  In  a  study  of  police  supervisors, 
Kavanagh  et  al.  (1982)  found  no  statistically  significant  relationships  between  accuracy, 
leniency,  halo,  and  range  restriction  based  on  ratings  of  videotapes  with  true  scores  (Borman, 
1979b)  gathered  In  the  laboratory  and  leniency,  halo,  and  range  restriction  of  ratings  used 
operationally  In  the  field.  Hedge  and  Kavanagh  (1983),  using  the  same  approach,  found  only  one 
relationship,  halo  from  laboratory  and  field  data,  to  be  significant.  These  unexpected  findings 
need  further  replication,  and  It  may  be  that  other  individual  characteristics  of  raters  are 
moderating  these  relationships. 
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Recent  literature  (Feldman,  1981;  Landy  &  Farr,  1980;  Wherry  &  Bartlett,  1982)  has  called 
attention  to  the  cognitive  processes  involved  in  evaluating  others.  Using  halo,  leniency,  range 
restriction,  and  rater  confidence,  Schneier  (1977)  found  that  cognitively  complex  raters  were 
more  confident  in  making  their  ratings  and  also  made  fewer  leniency  and  range  restriction 
errors.  Using  the  same  dependent  variables,  two  subsequent  studies  (Lahey  &  Saal,  1981;  Sauser  & 
Pond,  1981)  found  no  support  for  a  relationship  between  cognitive  complexity  and  measurement 
quality.  Finally,  in  three  further  studies  using  both  the  typical  psychometric  errors  and 
accuracy  as  the  measurement  quality  criteria,  Bernardin,  Cardy,  and  Carlyle  (1982)  found  no 
support  for  a  hypothesized  relationship  with  cognitive  complexity.  It  seems  clear  that  cognitive 
complexity,  as  measured  in  these  studies,  is  not  related  to  measurement  quality. 

Does  this  mean  that  the  cognitive  processes  of  the  rater  do  not  affect  the  quality  of  the 
measurement?  In  a  simplistic  approach  using  a  single  trait  such  as  cognitive  complexity,  the 
answer  is  probably  "yes."  However,  raters  use  multiple  cognitive  processes  (Landy  &  Farr,  1980; 
Wherry  &  Bartlett,  1982)  in  arriving  at  a  specific  judgment,  and  the  literature  supports  studying 
these  processes  separately  (as  indicated  in  Figure  3)  in  terms  of  their  effects  on  measurement 
quality.  Hedge  (1982)  found  that  training  raters  in  either  observational  techniques  or 
decision-making  techniques  differentially  affected  psychometric  errors  and  rating  accuracy. 
Murphy,  Martin,  and  Garcia  (1982)  found  tnat  ratings  made  1  day  after  viewing  videotapes  with 
true  scores  were  more  accurate  than  ratings  made  imediately  following  the  showing  of  tapes,  and 
suggested  that  different  memory  processes  were  in  operation.  Related  to  our  model  in  Figure  3, 
it  may  be  that  the  immediate  ratings  involve  only  observational  heuristics,  whereas  the  delayed 
ratings  involve  decision  heuristics. 

Evidence  that  there  are  individual  differences  in  the  decision  processes  of  raters  comes  from 
several  studies.  Policy-capturing  studies  have  identified  raters  who  were  consistent  in  their 
rating  policies  (Hobson,  Mendel,  &  Gibson,  1981),  raters  who  used  linear  regression  models  in 
their  strategies  (Zedeck  &  Kafry,  1977),  and  raters  who  used  different  decision  strategies 
depending  on  the  purpose  of  the  measurement  (Zedeck  &  Cascio,  1982).  The  latter  study,  although 
supportive  of  our  model,  unfortunately  did  not  include  validation  research  as  one  of  its  purposes 
of  measurement.  Finally,  one  study  found  that  different  raters  (self,  peer,  and  supervisor)  used 
different  aspects  of  performance  in  arriving  at  their  evaluations  of  performance  (Zaramuto, 
London,  &  Rowland,  1982).  Taken  together,  these  studies  indicate  that  there  are  Important 
individual  differences  in  both  observational  and  decision  processes  that  can  affect  the  quality 
of  measurement.  Future  research  should  treat  these  two  sets  of  cognitive  variables  separately 
when  attempting  to  determine  their  effects  on  the  quality  of  performance  measures. 


3.3  Rater-Ratee  Relationship 

Much  of  the  research  reviewed  in  this  and  the  previous  section  also  applies  to  observers  in 
the  proposed  WTPT  method.  Within  the  performance  appraisal  context,  most  of  the  research  has 
been  done  on  raters;  however,  the  processes  that  WTPT  observers  go  through  are  quite  similar  in 
that  they  also  require  observation  and  judgment.  They  differ  in  that  observers  have  more 
specific,  job-relevant  events  to  observe,  and  judgments  occur  immediately  following  the 
observation  of  the  events;  whereas  raters  must  make  judgments  based  on  job  events  recalled  from 
some  previous  time  period.  Nevertheless,  it  is  probable  that  many  of  the  findings  for  raters 
will  generalize  to  observers.  Where  they  do  not,  or  when  special  considerations  must  be  made  for 
the  role  of  the  observer,  it  will  be  noteo. 

Research  on  the  rater/ratee-rating  quality  linkage  concerns  the  formal  and  informal 
relationships  between  raters  and  ratees,  as  well  as  the  influence  of  the  sex,  race,  and  age  of 
the  ratee  on  the  quality  of  the  measure.  Although  most  research  has  concentrated  on  the 
Interaction  of  the  rater's  sex,  race,  and  age  with  the  sex,  race,  and  age  of  the  ratee,  some 
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research  documents  the  main  effects  of  these  variables.  In  the  opinion  of  the  authors,  there  Is 
always  an  Interaction  between  raters  and  ratees  on  these  biographical  variables;  however,  not  all 
of  the  relevant  research  Investigated  the  Interactions. 

Several  studies  examined  characteristics  of  the  relationship  between  raters  and  ratees.  The 
degree  of  responsibility  the  rater  had  over  the  ratee's  previous  performance  (Bazerman,  Beekun,  & 
Schoorman,  1982),  the  rater's  familiarity  with  the  ratee's  previous  performance  (Jackson  & 
Zedeck,  1982;  Scott  &  Hamner,  1975),  and  the  degree  of  acquaintance  between  rater  and  ratee 
(Freeberg,  1968)  have  all  been  shown  to  affect  the  quality  of  job  performance  measurement.  The 
degree  of  acquaintance  variable  Is  perhaps  most  interesting.  It  Is  axiomatic  that  the  rater  must 
be  at  least  somewhat  knowledgeable  about  ratee  performance  to  complete  the  rating.  In  fact,  most 
authors  argue  that  the  rater  must  have  had  the  opportunity  to  observe  job-relevant  behaviors  or 
else  the  rating  will  contain  error  (e.g.,  see  Borman,  1975).  However,  Stone  (1970)  has  argued 
that  as  the  degree  of  aquaintance  Increases,  the  possibility  of  bias  in  terms  of  halo  increases, 
particularly  If  the  rater  and  the  ratee  become  friends.  This  logic  is  consistent  with  Wherry's 
theory  of  rating  (Wherry  &  Bartlett,  1982);  however.  It  has  not  been  directly  tested  in  the 
performance  measurement  domain. 

Three  studies  (Freeberg,  1968;  Gordon,  1972;  Quinn,  1969)  have  partially  addressed  the  Issue 
of  rater-ratee  acquaintance;  however,  none  of  these  studies  included  the  full  range  possible  for 
the  acquaintance  variable.  Freeberg  (1968)  could  not  find  any  effect  due  to  length  of 
acquaintance,  and  concluded  that  degree  of  acquaintance  Is  only  Important  for  rating  quality  when 
It  provides  greater  opportunity  to  observe  job-relevant  behaviors.  Quinn  (1969)  found  no  effect 
for  acquaintance,  but  the  range  of  scores  on  length  of  acquaintance  was  not  sufficiently  large  to 
adequately  test  the  hypothesis.  Neither  of  these  two  studies  Included  friendship  as  a  variable 
that  could  impact  on  rating  quality.  Gordon  (1972),  In  a  study  tangential  to  this  issue,  found 
that  the  favorablllty  of  the  rater's  Impression  of  the  ratee  did  not  affect  the  accuracy  of  the 
ratings.  Favorablllty  of  impression  may  be  close  to  friendship;  unfortunately,  Gordon's  study 
was  a  laboratory  simulation  using  college  students  as  the  subjects,  thus  limiting  the 
general Izablllty  of  the  results.  It  seems  clear  that  well-designed  studies  are  needed  to  explore 
the  acquaintance/friendship  variable. 

Many  studies  have  investigated  the  effects  of  differences  In  the  sex,  race,  or  age  of  both 
raters  and  ratees  on  the  quality  of  performance  measures.  In  such  research.  It  Is  assumed  that 
these  biographical  variables  are  causing  error  In  the  measurement,  but  few  studies  have  used 
accuracy  as  the  criterion  as  defined  In  this  report.  Evidence  of  bias,  rather,  comes  from  level 
differences  that  indicate  certain  groups  are  receiving  lower  ratings  on  the  basis  of  their 
demographlcal  characteristics. 

Only  two  studies  examined  the  Influence  of  age  on  performance  measurement  quality  (Cleveland 
6  Landy,  1981;  Schwab  &  Heneman,  1378).  For  ratings  of  "paper  people"  vignettes  of  four 
secretaries  varying  In  age,  Schwab  and  Heneman  (1978)  found  no  age- associated  differences  In  the 
accuracy  of  ratings.  Cleveland  and  Landy  (1981),  examining  ratings  from  513  exempt  managers, 
found  that  older  workers  were  rated  lower  on  two  performance  dimensions,  self-development  and 
interpersonal  skills.  They  replicated  these  findings  on  a  second  sample.  It  Is  not  clear, 
however,  whether  Cleveland  and  Landy' s  results  Indicate  differences  In  accuracy  for  ratings  of 
older  workers,  or  actual  differences  in  the  performance  of  the  older  workers  in  the  sample. 
Finally,  It  should  be  noted  that  these  two  studies  did  not  examine  the  Interaction  with  rater 
age,  and  its  effect  on  measurement  quality. 

Research  results  related  to  sex  effects  on  performance  measurement  quality  have  been 
Inconsistent.  Two  studies  (Blgoness,  1976;  Hamner,  Kim,  Baird,  &  Blgoness,  1974)  found  that 
females  were  rated  higher  than  males,  and  this  effect  was  accentuated  when  they  were  performing 
at  high  (versus  low)  levels.  However,  both  were  laboratory  studies  using  videotapes  and 
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employing  students  as  raters.  In  two  field  studies  (Cascio  &  Phillips,  1979;  Mobley,  1982),  sex 
was  found  to  have  little  effect  on  performance  measures.  Although  females  were  rated  higher  than 
males  in  the  latter  study,  neither  study  found  any  effect  of  sex  differences  on  personnel 
decisions.  In  another  laboratory  study,  Lee  and  Alvares  (1977)  found  no  differences  In  ratings 
as  a  function  of  sex.  Feild  and  Holley  (1977),  in  a  field  study,  found  that  sex  differences 
existed  within  specific  job  classes,  but  one  could  not  generalize  these  differences  across  all 
job  classes.  Consistent  with  this  finding,  Nieva  and  Gutek  (1980),  in  a  review  of  the  empirical 
literature,  concluded  that  evaluation  bias  by  sex  does  exist,  but  its  effects  are  not  consistent 
across  all  situations.  Their  more  specific  conclusions  indicated  that  the  degree  of  bias  is  a 
function  of  the  level  of  inference  required  in  the  rating,  sex-role  incongruency  with  the  job, 
and  level  of  qualification  or  performance.  Of  particular  importance  for  this  review  are  the 
effects  of  sex-role  incongruency  on  the  accuracy  of  performance  measures  collected  for  validation 
purposes.  The  level  of  qualifications  or  of  performance  in  a  job  can  only  serve  to  increase 
sex-mle  incongruency;  thus,  it  should  also  be  included  in  research  on  this  possible  bias.  The 
level  cf  inference  required  can  be  controlled  by  the  scale  development,  and  that  will  be 
discussed  later  in  this  section. 

Results  on  race  are  also  somewhat  inconsistent.  Various  studies  have  found  significant  bias 
effects  due  to  race  (Bigoness,  1976;  Hall  &  Hall,  1976;  Hamner  et  al.,  1974;  Feldman  &  Hilterman, 
1977;  London  &  Poplawski,  1976).  Other  studies  have  not  found  these  race  effects  (Bass  &  Turner, 
1973;  Cascio  &  Phillips,  1979;  Cascio  &  Valenzi,  1978;  Farr,  O'Leary,  &  Bartlett,  1971;  Greenhaus 
&  Gavin,  1972;  Huck  &  Bray,  1976;  Mobley,  1982;  Moses  &  Boehm,  1975).  It  is  interesting  to  note 
that  these  latter  studies  were  done  in  field  settings,  whereas  the  former  were  done  in  the 
laboratory.  Wendelken  and  Inn  (1981)  noted  this  distinction,  and  conducted  a  major  field  study 
investigating  race  main  effects  and  the  interaction  between  raters  and  ratees.  They  found 
significant  effects  for  ratee  race,  rater  race,  rater-ratee  interaction,  and  past  performance  of 
the  ratee;  however,  these  four  effects  accounted  for  only  45t  of  the  variance  in  the  ratings.  As 
a  result,  they  suggested  that  the  larger  effects  found  in  the  laboratory  are  artifacts.  It 
appears  that  race  of  rater  or  ratee,  or  the  interaction,  may  well  affect  the  measurement  quality 
of  ratings  collected  for  validation  purposes;  but,  the  effect  is  expected  to  be  small.  To  the 
extent  that  this  variable  may  impact  on  the  accuracy  of  the  measures,  it  needs  to  be  investigated. 

The  impact  of  other  variables  involved  in  the  rater-ratee  relationship  on  measurement  quality 
have  also  been  demonstrated.  Drory  and  Ben-Porat  (1980)  found  that  leadership  style, 
specifically  initiating  structure,  was  related  to  leniency  on  task-related  dimensions  In  rating 
subordinates.  Cascio  and  Valenzi  (1977)  found  that  rater-ratee  experience  and  education  both 
impacted  significantly  on  ratings,  but  they  also  concluded  that  the  practical  significance  of  the 
results  were  not  very  important.  Beatty,  Schneier,  and  Beatty  (1977)  found  that  ratees  perceived 
desired  job  behaviors  as  occurring  more  often  and  undesired  behaviors  occurring  less  often  than 
did  raters.  Finally,  Barrett  (1964)  found  that  supervisors  give  high  ratings  to  subordinates  who 
do  their  work  the  way  the  supervisor  wants  it  done,  regardless  of  what  the  Job  standards 
indicate.  All  of  these  studies  indicate  the  importance  of  the  rater-ratee  relationship  In 
determining  the  quality  of  ratings;  however,  no  single  finding  appears  important  enough  to  merit 
further  individual  study  when  performance  measures  are  collected  for  validation  purposes  only. 


3,4  Measurement  Method 


It  is  assumed  that  all  of  the  measurement  methods  that  are  currently  available  in  the 

behavioral  science  literature  suffer  from  criterion  deficiency,  some  being  worse  than  others 
(e.g.,  production  records  or  counts  are  probably  worst).  Further,  it  is  posited  that  each  of 
these  methods  measures  a  part  of  the  criterion  space  with  more  accuracy  than  other  measures.  For 

example,  peer  ratings  may  be  the  most  accurate  method  for  assessing  interpersonal  skills  that  are 

job  relevant;  yet  when  used  to  assess  other  parts  of  the  criterion  space,  often  all  that  is 
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measured  Is  error  variance.  That  is,  the  various  measurement  methods  have  frequently  been  used 
to  assess  parts  of  the  criterion  space  which  they  are  ill  suited  to  measure.  This  leads  to  two 
conclusions.  First,  In  research  using  multiple  methods  to  assess  individual  job  performance,  the 
empirical  demonstration  of  a  zero  relationship  is  as  Important  as  a  non-zero  one  in  demonstrating 
the  construct  validity  of  the  measures.  The  second  conclusion  is  that  one  must  use  multiple 
methods  of  measurement  to  accurately  assess  the  total  criterion  space. 

The  industrial/organizational  psychology  literature  tangentially  supports  these  arguments. 
Research  comparing  multiple  sources  (Baird,  1977;  Basset  &  Meyer,  1968;  Blackburn  &  Clark,  1975; 
Borman,  1974;  Griffiths,  1975;  Holzback,  1978;  Ilgen,  Peterson,  Martin,  &  Boeschen,  1981; 
Kavanagh,  MacKInney,  &  Wollns,  1971;  Klimosk  &  London,  1974;  Kraut,  1975;  Meyer,  1980;  Schneier  & 
Beatty,  1978;  Thorton,  1980;  Wiley  &  Hahn,  1977;  Zammuto  et  a!.,  1982)  has  found  (a)  disagreement 
among  factor  structures  for  different  rating  sources,  (b)  differences  in  rating  strategies,  and 
(c)  low  discriminant  validities  for  the  measurement  methods.  On  the  basis  of  his  results,  Borman 
(1974)  argued  that  a  "hybrid"  rating  system  should  be  created  in  which  raters  make  evaluations  on 
only  those  dimensions  they  are  in  a  good  position  to  rate.  Likewise,  Schneier  (1977),  In 
reviewing  the  literature,  concluded  that  It  is  erroneous  to  collapse  ratings  across  raters,  ana 
that  the  use  of  multiple  sources  could  improve  rating  accuracy.  Thus,  there  is  support  for  the 
argument  that  a  measurement  methodology  for  assessing  Individual  job  performance  should  Include 
multiple  sources,  with  each  measurement  source  measuring  only  that  part  of  the  criterion  space 
for  which  it  has  the  highest  fidelity. 

There  has  been  no  research  attempting  to  develop  a  multiple-method  approach  co  the  assessment 
of  job  performance.  As  a  first  step,  it  will  be  necessary  to  determine,  for  each  measurement 
method,  which  part  of  the  criterion  space  It  can  best  measure.  Though  all  measuremert  methods, 
including  production  records,  may  be  assessing  some  true  variance  In  the  criterion  space,  the 
problem  is  the  present  uncertairty  about  which  methods  tap  which  parts  of  the  criterion  domain; 
this  should  be  the  first  reseai  .h  conducted.  There  has  been  good  research  done  on  the  various 
methods,  and  some  hypotheses  can  be  developed  about  their  use  In  a  multiple-method  appraisal 
system.  Once  each  of  the  methods  can  be  refined  sufficiently  and  It  can  be  determined  that  they 
are  assessing  certain  parts  of  the  criterion  space  with  a  high  degree  of  accuracy,  the  next  step 
should  be  to  investigate  the  feasibility  of  developing  and  using  a  multiple-method  approach  for 
the  measurement  of  Individual  job  performance. 

Earlier,  we  listed  five  methods  of  performance  assessment.  It  should  be  noted  that  there  are 
other  methods  for  the  assessment  of  Job  performance;  however,  these  are  the  methods  that  have 
been  used  most  successfully  in  the  past. 

The  evidence  supporting  supervisory  ratings  Is  considerable.  In  addition  to  the  multiple- 
source  studies  cited  earlier,  Landy  and  Farr  (1980)  provided  a  comprehensive  review  Indicating 
the  acceptability  of  supervisory  ratings.  Supervisors  are  often  In  the  best  position  to  assess 
the  person's  overall  contribution  to  the  effectiveness  of  the  work  unit.  That  Is,  the  supervisor 
may  be  best  qualified  to  weigh  the  person's  performance  across  the  various  parts  of  the  criterion 
space  to  reach  an  overall  judgment. 

Support,  for  the  use  of  peer  ratings  Is  also  available  in  the  literature  (cf.  Downey,  Medland, 
&  Yates,  1976;  Fiske  1  Cox,  1950;  Kaufman  &  Johnson,  1974;  Lewln  &  Zwany,  1976).  Peer  ratings 
have  had  their  best  success  at  predicting  future  behavior,  and  It  Is  hypothesized  that  this  Is  so 
because  they  accurately  assess  those  job-relevant  Interpersonal  "survival"  skills  that  are 
Important  for  successful  performance  across  jobs. 

In  addition  to  the  multiple-method  studies  that  have  Included  self  ratings  of  performance, 
evidence  supporting  the  usefulness  of  self  ratings  appeared  In  a  recent  review  (Mabe  l  West, 
1982).  These  authors  concluded  that,  controlling  for  measurement  conditions  (e.g.,  specificity 


of  the  measurement  instrument,  amount  of  prior  self  evaluation  experience),  self  ratings  can 
provide  good  measures  of  abilities. 

The  fourth  method,  WTPT,  combines  interviewing  and  observing,  and  includes  elements  of  work 
sampling,  simulation,  and  work  observation  techniques.  There  is  strong  support  in  the  literature 
for  the  use  of  all  of  these  techniques  for  the  measurement  of  individual  job  performance  (Boehm, 
1982;  Hakel,  1982;  Robertson  &  Downs,  1979;  Smith,  1976;  Vineberg  &  Joyner,  1982).  As  discussed 
previously,  this  method  will  focus  on  measuring  the  smallest  possible  unit  of  job-relevant 
behaviors.  The  test  administrator  will  not  only  be  involved  in  the  observation  of  specific  job 
tasks,  but  will  also  ask  the  job  holder  to  describe  verbally  how  he  or  she  would  deal  with  a 
specific  job-related  task  or  problem.  Thus,  the  method  will  assess  job-related  skills,  and  also 
perhaps  interpersonal  skills,  and  even  supervisory  skills.  It  is  hypothesized  that  this  method, 
given  proper  development,  will  provide  the  most  accurate  measurement  of  the  technical  competence 
portion  of  the  criterion  space. 

The  fifth  method,  production  and  other  objective  records,  is  included  for  a  variety  of 
reasons.  First,  such  measures  are  usually  readily  available  and  are  commonly  used  as  indices  of 
the  effectiveness  of  units  and  organizations.  There  is  support  for  their  use  in  the  literature 
(Ronan  &  Prieri,  1971),  particularly  in  jobs  that  require  production  quality  and  quantity  counts. 
But  most  importantly,  they  can  be  used  to  capture  an  important  part  of  the  criterion  space. 
Typically,  these  types  of  measures  indicate  the  employee's  compliance  with  certain  work  or 
organizational  rules.  Violations  of  these  rules,  though  infrequent,  are  what  creates  the 
measurement  problem.  However,  aggregating  these  records  in  some  form  may  provide  important 
information  on  individual  work  performance  that  is  not  being  measured  with  the  ether  methods. 
Whether  these  individual  differences  in  work  behavior  represent  a  lack  of  compliance  or  other 
motivational  problems  remains  to  be  determined,  but  it  is  hypothesized  that  measuring  them  will 
improve  the  overall  assessment  of  the  criterion  space. 

The  research  directions  seem  clear.  If  the  measurement  methodology  used  for  validation 
research  is  to  assess  job  performance  with  minimal  criterion  deficiency  and  maximum  accuracy, 
then  research  to  develop  and  validate  each  of  these  methods  is  necessary.  Success  in  these 
efforts  will  lead  to  a  multiple-method  research  effort  concerned  with  establishing  the 
differential  accuracy  of  each  measure  for  portions  of  the  criterion  space. 

3.5  Performance  Standards,  Scale  Characteristics,  and  Scale  Development 

With  the  exception  of  the  production  records,  all  of  the  methods  require  scales  to  measure 
job  performance.  Since  three  linkages  in  the  classification  scheme  in  Figure  3  are  concerned 
with  the  measurement  scale  (i.e.,  performance  standards,  scale  characteristics,  and  scale 
development),  the  evidence  fo>"  all  three  will  be  reviewed  in  this  section. 

Based  on  their  review  of  legal  cases  regarding  compliance  with  Equal  Employment  Opportunity 
Commission  (EEOC)  guidelines  concerning  performance  appraisal  forms,  Cascio  &  Bernardin  (1981) 
argued  that  the  performance  appraisal  form  must  have  performance  standards  if  it  is  to  be  In 
compliance  with  the  guidelines.  Because  this  recommendation  is  for  operational  systems,  it  may 
not  apply  to  performance  measurement  used  only  for  validation  purposes.  However,  if  the 
existence  of  performance  standards  can  improve  the  accuracy  of  the  measurement,  then  they  should 
be  used  regardless  of  the  purpose  of  the  measurement.  There  are  arguments  for  the  use  of 
performance  standards  (Alewine,  1982;  Kirby,  1981;  Morano,  1979)  but  unfortunately,  no  empirical 
results  that  would  support  their  use.  Performance  standards  on  the  rating  format  may  make  the 
form  easier  to  use  by  raters/observers,  and  thus  enhance  its  accuracy;  however,  this  has  not  been 
tested  empirically,  and  remains  an  avenue  for  research. 


As  a  number  of  authors  have  noted,  the  search  for  the  single  best  scale  format  for  measuring 
job  performance  or  the  best  type  of  content  for  the  scale  has  resulted  in  no  conclusive  evidence 
(Kavanagh,  1971,  1982a;  Kingstrom  &  Bass,  1981;  Landy  &  Farr,  1980;  Muczyk  £>  6able:  1981;  Schwab, 
Heneman,  &  DeCotils,  1975).  This  does  not  mean  that  a  researcher  or  a  practitioner  can  be 
cavalier  about  the  selection  of  a  format  or  other  scale  characteristics  when  developing  a 
performance  appraisal  scale.  In  fact,  as  all  of  the  cited  reviewers  have  noted,  care  in 
development  of  the  measurement  scales  is  more  important  for  the  quality  of  the  measure  than  is 
the  specific  format  or  content  chosen.  The  one  lesson  that  the  enormous  literature  on  scale 
development  and  scale  characteristics  has  demonstrated  is  that  there  are  right  and  wrong  ways  to 
develop  performance  measurement  scales.  This  section  will  focus  on  the  literature  supporting 
these  prescriptions  for  scale  development,  noting  where  research  may  be  needed  when  the  purpose 
of  measurement  is  for  validation  only. 

In  terms  of  scale  development,  it  seems  clear  that  the  scales  must  be  based,  in  some  way,  on 
job  descriptions  (Cascio  &  Bernardin,  1981)  and  job  task  requirements  (Cornelius,  Hakel,  & 
Sackett,  1979;  Rosinger  et  al.,  1982;  Tosti,  1979).  It  also  seems  clear  that  the  participation 
and  support  of  management  (Beer,  Ruh,  Dawson,  McCaa,  &  Kavanagh,  1978)  improves  the  quality  of 
the  developmental  process  and  thus,  the  quality  of  the  measurement.  Although  It  is  quite  common 
to  include  both  raters  and  job  Incumbents  in  the  scale  development  process,  the  evidence 
supporting  this  approach  is  meager  and  Indirect  (Friedman  &  Cornelius,  1976;  Williams  &  Seller, 
1973).  Most  of  the  benefits  derived  from  rater  and  employee  participation  are  a  result  of  their 
increased  acceptance  of  the  system.  For  validation  purposes,  acceptance  of  the  system  is 
probably  crucial  if  one  expects  to  obtain  accurate  measures.  Thus,  It  appears  that  both  employee 
and  rater  participation  in  scale  development  may  be  necessary.  /, 

Regarding  scale  content  in  general,  there  seems  to  be  no  clear  resolution  as  to  whether  one 
should  use  personal  traits  or  performance  dimensions  only  (Kavanagh,  1971;  Massey,  Mullins,  & 
Earles,  1978).  However,  when  job  performance  is  the  target  domain  to  be  measured,  the  only 
conceptually  appropriate  content  is  performance  dimensions.  As  Kavanagh  (1982b)  noted  In  a 
review  of  some  of  the  early  research  contrasting  behavioral  anchored  rating  scales  (BARS)  with 
non-anchored  Graphic  Rating  Scales  (e.g.,  Borman  &  Dunnette,  1975;  Burnaska  l>  Hollman,  1974; 
Campbell,  Dunnette,  Arvey,  &  Hellervik,  1973;  Keaveny  &  McGann,  1975;  Maas,  1965),  the  critical 
feature  of  the  rating  scale  is  how  clearly  the  performance  dimensions  are  described. 

In  the  multitude  of  studies  reviewed  concerning  development  of  alternative  forms  for  the 
measurement  of  job  performance  (Arvey  &  Hoyle,  1974;  Beatty  et  al.,  1977;  Bernardin,  1977; 
Bernardin,  Alvares,  &  Cranny,  1976;  Bernardin  fc  Smith,  1981;  DeCotils,  1977;  Dickinson  fc  Tice, 
1977;  Fay  &  Latham,  1982;  Finley,  Qsburn,  Dubln,  &  deanneret,  1977;  Githens  *  tlster,  1973; 
Ivancevlch,  1980;  King,  Hunter,  &  Schmidt,  1980;  Latham  &  Wexley,  1977;  Mullins,  Weeks,  t 
Wilbourn,  1978;  Nugent,  Laabs,  &  Panell,  1982;  Rizzo  &  Frank,  1977;  Rosinger  et  al,,  1982;  Saal  & 
Landy,  1977;  Schwartz,  1977;  Seaton,  1974;  Shaplra  &  Shlrom,  1980;  Siegel,  1982;  Zedeck,  Kafry,  t 
Jacobs,  1976),  several  rather  strong  conclusions  emerged.  First,  the  anchors  or  descriptors  that 
define  performance  levels  on  job  dimensions  must  be  observable  job  behaviors  or  accomplishments. 
Second,  these  observables  must  be  related  to  job-relevant  tasks.  Third,  the  scale  must  be 
structured  such  that  the  rater  can  use  It  easily.  Fourth,  the  format,  used  Is  not  as  Important  as 
these  other  characteristics.  Finally,  If  an  overall  measure  of  Job  effectiveness  is  to  be 
collected,  It  should  occur  at  the  end  of  the  form,  after  Individual  dimensions  of  Job  performance 
have  been  assessed. 

Some  experts  (Kavanagh,  1980,  1982b;  McAfee  &  Green,  1977)  have  argued  that  In  selecting  an 
appraisal  format,  a  broader  class  of  criteria  should  be  used  In  addition  to  the  traditional 
psychometric  ones.  McAfee  6  Green  (1977)  have  contended  that  the  various  performance  appraisal 
formats  available  are  differentially  useful,  depending  on  the  purpose  of  the  measurement.  For 
validation  purposes,  they  suggested  that  direct  Indexes  (objective  or  administrative  measures 
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such  as  AWOL  rate),  weighted  checklists,  forced  choice,  or  other  variations  on  rating  scales  can 
be  appropriate.  Kavanagh  (1980,  1982b)  listed  19  different  criteria  that  could  be  Important  In 
choosing  a  scaling  format  for  the  measurement  of  job  performance.  It  seems  Important  that  these 
additional  criteria  be  carefully  considered  In  the  validation  effort. 

Also,  at  least  one  expert  has  shown  that  going  beyond  five  scalar  points  does  not  Increase 
the  reliability  of  the  measures  (Smith,  1976).  However,  there  is  also  evidence  (Finn,  1972)  that 
up  to  seven  scalar  points  Improves  the  quality  of  the  measures  In  performance  ratings.  It  seems 
that  the  number  of  scalar  points  is  not  an  Important  research  Issue  as  long  as  the  scale  contains 
at  least  five.  However,  the  decision  to  have  more  scalar  points  should  not  be  based  on 
psychometric  properties  alone.  There  may  be  other  Important  reasons  such  as  Improving  rater 
acceptance  of  the  form. 

Two  other  Important  research  Issues  remaining  to  be  resolved  concern  scale  content  and  the 
scale  development  process  itself. 

The  content  Issue  involves  such  questions  as:  Are  10  to  12  performance  dimensions  adequate 
to  cover  the  job  performance  criterion  space?  Are  there  performance  dimensions  that  are  common 
to  all  jobs  within  an  Air  Force  Specialty  (AFS)?  There  is  some  evidence  that  there  may  not  be 
universal  dimensions  for  job  performance  (Feild  &  Holley,  1975),  and  that  different  raters  may 
define  the  job  performance  criterion  space  differently  (Borman,  1974;  Taylor  &  Wllsted,  1974; 
Zedeck,  Imparato,  Drausz,  &  Oleno,  1974). 

Based  upon  the  literature,  and  the  authors'  experience  in  building  performance  measurement 
systems,  several  hypotheses  have  been  formed  concerning  this  content  issue.  First,  many  jobs 
contain  two  general  categories  of  performance  dimensions,  technical  skills  and  job-relevant 
personal  and  interpersonal  skills.  Second,  there  are  a  nwiber  of  universal  job  dimensions  that 
are  common  to  all  enlisted  jobs;  for  example,  combat  readiness  and  communications  skills. 
However,  the  way  In  which  persons  In  different  jobs  perform  their  job-relevant  tasks  may  be  quite 
different.  To  take  a  simple  example,  the  communication  skills  required  for  effective  job 
performance  as  an  aircrew  mechanic  would  be  decidedly  different  from  those  for  a  crew  chief  or  a 
clerical  job.  Third,  the  number  of  job  dimensions  required  to  cover  adequately  the  job 
performance  criterion  space  Is  probably  not  more  than  12  for  non- supervisory  positions  and  15  for 
supervisory  positions.  Obviously,  the  speculations  voiced  here  need  to  be  empirically  verified. 

Finally,  a  number  of  authors  have  raised  questions  about  the  techniques  typically  used  to 
develop  rating  scales  (Bernardln  &  Kane,  1980;  Dickinson  &  Tice,  1977;  Kane  &  Bernardln,  1982; 
Kavanagh  &  Duffy,  1978;  Latham,  Saarl,  &  Fay,  1980;  Schwab  et  al.,  1975;  Shaplra  l>  Shlrom, 
1980).  The  development  of  behavior-based  rating  scales  starts  with  the  generation  of  "critical 
Incidents"  of  job-relevant  behaviors  and  accomplishments.  These  are  usually  collected  from  job 
Incumbents;  however,  they  could  be  collected  from  supervisors,  job  knowledge  experts,  or  peer 
employees  In  parallel  but  different  positions.  When  a  performance  measurement  system  Is  to  be 
used  for  validation  purposes,  a  crucial  research  question  Is:  Which  of  these  sources  of  critical 
incidents  will  lead  to  development  of  the  most  accurate  measurement  scale.  As  a  second  step, 
behavior-based  rating  scale  development  typically  Involves  a  second  set  of  judgments  or  a 
statistical  clustering  to  determine  which  are  the  best  critical  incidents  and  job  performance 
dimensions  to  put  Into  the  final  rating  form.  Again,  the  research  Issue  Is  which  of  these 
various  techniques  will  result  In  the  most  accurate  performance  measure  for  use  In  validation 
research.  These  are  Important  research  issues  for  the  rating  and  WTPT  methods  being  considered 
for  AFHRL.  .  ••  ■■=  v 
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3.6  Environmental  Context.  Non-Work  Variables,  Performance  Constraints, 
and  Organization/Unit  Norms 

These  four  classes  of  variables,  as  identified  in  Figure  3,  will  be  considered  here  as  a 
single  category  since  their  combined  effect  generally  results  in  increased  error  variance.  These 
variables  are  similar  to  the  environmental  influences  term  in  Wherry's  theory  of  rating  (Wr.etry  & 
Bartlett,  1982);  however,  our  approach  has  greater  specificity,  allowing  for  better 
identification  of  research  issues. 

There  Is  ample  evidence  that  environmental  and  situational  factors  can  affect  the  quality  of 
performance  measures  (Borman,  1978;  McCall  A  DeVries,  1976;  Schwab  et  al.,  1975;  Scott  A  Hamner, 
1975;  Turnage  &  Muchinsky,  1982).  There  is  also  evidence  of  the  impact  on  measurement  quality  of 
specific  situational  variables  such  as  unit  norms  (Grey  A  Kipnis,  1976),  situational  constraints 
(Peters,  Fisher,  &  O'Connor,  1982;  Peters  A  O’Connor,  1980;  Peters,  O'Connor  A  Rudolf,  1980),  and 
social  context  (Knowlton  A  Mitchell,  1980;  Mitchell  A  Liden,  1982;  Wood  A  Mitchell,  1981).  Among 
the  non-work  variables,  there  is  documented  evidence  that  marital  happiness  and  environmental 
stressors  such  as  duration  on  task  directly  inpact  job  performance  (e.g.,  Rose,  Jenkins,  A  Hurst, 
1978;  Sharit  A  Salvendy,  1982;  Wilkinson,  1969). 

In  terms  of  non-work  variables,  environmental  stressors,  and  performance  constraints,  it  is 
important  to  determine  if  these  are  contributing  to  error  variance  in  the  measures.  Or, 
alternately,  are  raters  adjusting  their  judgments  because  they  are  aware  of  the  existence  of 
these  factors  and  their  temporary  negative  effect  on  the  person's  true  level  of  performance?  For 
example,  does  a  supervisor  adjust  his/her  ratings  of  the  job  performance  of  an  employee  because 
the  employee  has  just  had  a  death  in  the  family?  It  may  be  that  instructions  or  training 
emphasizing  such  factors  may  be  sufficient  to  minimize  their  potential  contribution  to  error 
variance  in  the  measures.  This  will  be  discussed  more  fully  in  a  later  section;  however,  it  Is 
apparent  that  these  factors  must  be  considered  in  developing  an  accurate  measurement  system. 

Some  research  must  also  bt  conducted  to  ensure  that  the  final  measurement  system  Is  not  being 
adversely  affected  by  social  context  or  the  existence  of  unit  or  organizational  norms.  Because 
of  the  Importance  of  organizational  and  unit  norms  in  the  military,  these  variables  should 
receive  particular  attention.  '  . 


3.7  Public  Relations  and  Administrative  Procedures 


The  notion  that  confusing  administrative  procedures  and  instructions  to  raters  will  diminish 
the  quality  of  the  measures  can  be  inferred  from  the  literature.  It  seems  apparent  that  the 
clarity  of  presentation  and  the  administrative  procedures  for  scales  with  observable  job-relevant 
behaviors  has  led  to  the  relative  success  of  this  type  of  format.  Two  major  efforts  involving 
the  development.  Implementation,  and  evaluation  of  performance  measurement  systems  (Beer  et  al., 
1978;  Kavanagh,  DeBiasI,  Hedge,  A  Miller,  1983)  provide  indirect  evidence  ot  the  Importance  of 
these  variables.  In  both  efforts,  the  Importance  of  a  public  relations  program  to  "pre-sell"  the 
performance  measurement  system  was  emphasized.  They  also  noted  the  importance  of  foms 
management,  clarity  of  Instructions  to  raters,  and  ensuring  the  administrative  procedures  for  the 
measurement  system  were  consistent  with  other  personnel  procedures  in  the  organization.  Although 
the  literature  generally  has  paid  little,  if  any,  attention  to  these  variables.  It  Is  believed 
they  are  crucially  important  to  the  acceptance  and  accuracy  of  a  performance  measurement  system 
for  validation  purposes.  Because  the  AFHRl  system  must,  at  some  point,  be  used  for  large-scale 
data  collection,  the  importance  of  these  variables  is  amplified. 


Finally,  as  noted  previously,  it  may  be  possible  to  minimize  error  variance  by  the  use  of 
instructions  or  public  relations.  For  example,  a  public  relations  program  which  emphasizes  that 


the  performance  measurement  system  Is  for  validation  research  only,  and  that  raters  should 
disregard  the  organizational  norms  that  predispose  raters  to  judge  everyone  an  "8H  or  a  "9,*  may 
be  effective  in  minimizing  this  potential  source  of  error.  Likewise,  clear  instructions  to 
raters  to  not  let  off-the-job  factors  influence  their  evaluations  of  employees  may  be  sufficient 
to  control  for  this  potential  error  source.  It  seems  clear  that  this  type  of  research  will  be 
necessary  if  the  performance  measurement  system  is  to  generate  accurate  ratings. 


3,8  Rater  Training 

This  Is  one  of  the  most  Important  linkages  In  the  classification  scheme  in  Figure  3.  Though 
most  of  the  research  reviewed  deals  with  rater  training,  the  principles  are  equally  applicable  to 
observer  training  in  the  WTPT  method.  However,  observer  training  is  not  seen  to  nave  as  many 
research  Issues  as  rater  training  since  observer  training  is  much  longer,  more  under  the  control 
of  the  researcher,  and  can  thus  do  a  better  job  of  eliminating  problems  that  negatively  affect 
the  quality  of  ratings. 

Although  rater  training  does  not  always  significantly  Impact  measurement  quality,  most 
empirical  studies  show  positive  effects  of  training  on  both  the  traditional  psychometric  concerns 
and  accuracy  (Bernardin,  1978;  Bernardin  &  Pence,  1980;  Bernardin  &  Walter,  1977;  Borman,  1975, 
1979b;  Brown,  1968;  Fay  &  Latham,  1982;  Hedge,  1982;  Ivancevich,  1979;  Latham,  Wexley,  !•  Pursell, 
1975;  Levine  &  Butler,  1952;  Sauser  &  Pond,  1981;  Spool,  1978;  Stockford  &  Blssell,  1949;  Taylor 
&  Hastman,  1956;  Thornton  &  Zorich,  1980;  Uarmke  l  Billings,  1979;  Zedeck  &  Casclo,  1982). 
Clearly,  it  is  not  necessary  to  conduct  research  to  determine  if  rater  training  should  be  a  part 
of  a  performance  measurement  system;  rather,  the  important  research  Issues  Involve  the  specific 
characteristics  of  the  training. 

Length  of  the  training  is  a  research  issue.  Although  there  is  little  evidence  that  strictly 
applies  to  the  accuracy  criterion,  the  literature  indicates  that  training  sessions  as  short  as  5 
minutes  can  Improve  the  quality  of  the  ratings  (Borman,  1975).  Zedeck  and  Casclo  (1982)  trained 
students  over  5  contact  hours  using  the  typical  "psychometric  error"  approach,  and  found  no 
effect  on  accuracy  using  the  "paper  people"  approach.  Hedge  (1982),  using  real  raters,  found 
significant  effects  on  accuracy  with  a  2-hour  training  session.  Thus,  the  issue  of  length  of 
training  has  not  been  resolved,  and  may  be  complicated  by  the  type  of  training  given. 

In  those  studies  that  used  the  traditional  psychometric  effects  as  the  criteria  of 
measurement  quality,  it  appears  that  participative  techniques  (group  discussion,  videotaping, 
role  playing)  are  better  than  lecture  only.  Training  raters  to  maintain  diaries  appears  to  also 
Improve  measurement  quality  (Bernardin  &  Walter,  1977).  In  those  studies  that  used  accuracy  as 
the  criterion  of  measurement  quality.  It  appears  that  the  traditional  "psychometric  error" 
training  approach  did  not  have  a  positive  effect  ^edge,  1982;  Zedeck  &  Casclo,  1982). 

However,  training  raters  to  be  better  observers  (Hedge,  1982;  Thornton  &  Zorich,  1980)  or 
better  decision  makers  (Hedge,  1982)  has  been  shown  to  have  positive  effects  on  the  accuracy  of 
the  measures.  Though  not  extensive,  the  evidence  certainly  suggests  an  important  research  Issue 
for  rater  training. 


A  research  issue  that  has  received  no  attention  is  who  should  do  the  training.  For 
large-scale  data  collection  in  the  Air  Force,  it  may  be  impractical  to  use  personnel 
psychologists  to  train  all  raters.  It  may  be  possible  to  use  senior  enlisted  personnel  as 
trainers  at  their  respective  bases  with  no  loss  in  terms  of  the  quality  and  effectiveness  of  the 
training.  This  Issue  will  obviously  have  critical  implications  for  AFHRL's  validation  research 
effort. 
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A  final  issue  In  rater  training  (and  observer  training)  is  possible  "wash-out"  effects  of 
training.  Even  with  lengthy  and  intensive  training  sessions,  raters  return  to  making  the  same 
errors  they  did  before  training  (Ivancevich,  1979;  Latham  et  al.,  1975).  There  has  been  no 
research  on  "booster"  or  refresher  training  of  raters  or  observers,  and  this  would  also  seem  an 
important  research  question. 


3.9  Intervening  Process  Variables 

Of  the  five  Intervening  variables  shown  In  Figure  3,  the  two  cognitive  variables  were 
discussed  earlier;  thus,  the  discussion  in  this  section  will  focus  on  the  acceptability,  trust, 
and  rater  motivation  issues.  Research  has  shown  that  the  acceptability  of  performance  appraisals 
and  appraisal  systems  can  significantly  impact  measurement  quality  (Dipboye  ft  dePontbriand,  1981; 
Kavanagh  &  Hedge,  1983;  landy  et  al.,  1978).  When  a  performance  measurement  system  is  being  used 
for  validation  research,  user  confidence  that  accurate  judgments  can  be  made  about  job 
performance  may  affect  measurement  quality.  This  issue  needs  to  be  resolved  through  empirical 
research. 

Rater  motivation  has  been  essentially  Ignored  by  performance  appraisal  researchers.  This  may 
be  the  result  of  a  general  belief  that  individual  differences  do  not  exist  across  raters  in  their 
motivation  to  rate  accurately.  Although  OeCotiis  and  Petit  (1978)  incorporated  rater  motivation 
as  an  important  part  of  their  model  of  the  appraisal  process,  they  cited  only  Taft's  (1971) 
theory  of  Interpersonal  judgments  as  support  for  the  Inclusion  of  this  variable  in  the  model. 
Recently,  Bernardin  and  his  colleagues  (Bernardin  &  Cardy,  1982;  Bernardin,  Orban,  t  Carlyle, 
1981)  have  focused  on  rater  motivation,  but  only  in  terms  of  how  it  might  be  affected  by  the 
level  of  trust  a  rater  has  in  the  appraisal  system. 

It  Is  hypothesized  that  there  are  differences  in  motivation  and  trust  in  the  appraisal  system 
across  raters.  In  addition,  it  seems  logical  that  appraisal  accuracy  may  be  affected  by  the 
Interaction  of  motivation  or  trust  and  such  system  characteristics  as  purpose  of  appraisal, 
administrative  procedures  used,  and  appraisal  format.  For  example,  rater  motivation  to  provide 
accurate  ratings  may  be  higher  when  ratings  are  gathered  for  research  purposes  rather  than  for 
administrative  or  developmental  purposes.  In  addition,  for  a  "research  purposes  only"  system, 
rater  motivation  to  rate  accurately  may  depend  on  the  administrative  procedures/public  relations 
used  in  Implementing  the  system.  These  and  other  similar  Issues  need  to  be  resolved  empirically. 


IV.  RESEARCH  IMPLICATIONS 

As  discussed  In  Chapter  I,  the  major  objective  of  the  classification  scheme  and  literature 
review  was  to  provide  guidelines  for  specifying  research  needs  in  developing  a  performance 
measurement  system  for  validation  research  In  the  military.  Implications  for  research  will  be  of 
two  major  types:  (a)  specification  of  areas  where  research  is  not  needed;  that  Is,  where  the 
literature  provides  prescriptive  advice  on  system  characteristics;  and  (b)  specification  of  areas 
where  further  research  Is  needed;  that  is,  where  the  empirical  evidence  is  inconclusive. 

We  have  taken  a  conservative  approach  to  accepting  that  a  given  system  characteristic  will 
Impact  on  accuracy,  especially  when  evidence  relates  only  to  the  Impact  of  the  other  rating 
quality  Indicators.  On  the  other  hand,  where  there  are  a  number  of  studies  examining  a 
particular  system  characteristic,  and  the  results  are  consistent  across  studies.  It  will  be 
concluded  that  no  further  research  is  needed,  even  though  accuracy  was  not  used  to  assess 
measurement  quality.  These  recommendations  will  be  based  on  the  judgment  that  the  empirical 
evidence  is  strong  enough  to  warrant  generalizing  to  the  accuracy  criterion,  and  further,  that 
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the  cost  of  the  research  would  be  a  waste  of  resources.  In  any  case,  we  have  recommended  that 
research  be  conducted  in  those  important  areas  where  we  believe  there  to  be  Inadequate  empirical 

evidence. 


4.1  Individual  Characteristics 


It  seems  clear  that  rater/observer  individual  differences  will  affect  measurement  quality. 
Some  variables  shown  to  be  related  to  measurement  quality  are  the  personal  characteristics  of  the 
raters  (Borman,  1979b),  accuracy  of  the  rater  as  an  observer  (Murphy,  1982),  carefulness  and 
decisiveness  (Mullins  et  al.,  1979),  learned  associations  among  sets  of  behavioral  and 
personality  descriptors  (Hakel,  1974),  and  Interpersonal  trust  (Kavanagh,  Vance,  &  Wright, 
1982).  Although  cognitive  complexity,  as  commonly  measured,  is  not  related  to  measurement 
quality  (Bernardin  &  Cardy,  1981),  this  does  not  mean  that  cognitive  processes  are  not;  dividing 
them  into  observational  and  decision  processess,  as  done  in  the  conceptual  model,  has  some 
support  in  the  literature.  The  literature  strongly  supports  the  importance  of  cognitive 
processes  in  performance  measurement;  and  the  work  of  Hedge  (1982)  on  decision  processes  and 
accuracy  in  ratings,  as  well  as  Murphy's  (1982)  work  on  observational  processes  and  accuracy, 
indicates  these  are  avenues  to  pursue. 

Not  all  of  the  research  directions  suggested  by  the  literature  are  relevant  for  the 
development  of  a  measurement  methodology  for  job  performance,  particularly  one  to  be  used  only 
for  validation  purposes.  For  example,  Borman's  (1980)  suggestion  that  we  should  select  raters  on 
Individual  differences  that  are  related  to  rating  quality  is  an  intriguing  research  project,  but 
one  that  may  not  be  feasible  or  beneficial  for  the  Air  Force  research  project. 


A  major  research  effort  should  be  concerned  with  the  relationship  between  the  cognitive 
processes  (heuristics)  Involved  in  observing  and  judging.  Research  In  this  area  should  be 
sequential.  The  first  step  would  be  to  determine  if  different  observation  and  decision  processes 
are  differentially  related  to  accuracy  of  performance  ratings  and/cr  observations.  The  next  step 
Is  to  determine  if  rater/observtr  training  can  effectively  teach  raters/observers  to  use  these 
processes.  This  research  effort  could  be  quite  Important  to  the  WT PT  method.  The  work  of  Hedge 
(1982)  would  seem  a  good  starting  point  to  explore  both  processes  and  their  relationship  to 

accuracy. 


Some  of  the  earlier  cited  research  on  Individual  differences  might  be  useful  In  selecting 
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both  observations  and  judgments  about  human  performance.  To  ensure  high  fidelity  In  the  WTPT 
method,  It  will  be  necessary  to  be  sure  that  the  instrument  that  records  the  data  (the  observer) 
Is  a  source  of  only  minimal  error.  Thus,  It  might  be  quite  appropriate  to  select  observers. 
However,  selecting  raters  may  be  more  difficult. 


The  study  by  Murphy  (1982)  relating  observational  accuracy  to  rating  (judgment)  accuracy, 
although  limited  In  terms  of  the  sample  used,  has  some  interesting  Implications  for  research. 
If,  within  the  military  setting  using  performance  measurement  for  research  purposes  only.  It  can 
be  established  that  observational  accuracy  is  strongly  related  to  rating  accuracy,  then  it  may  be 
possible  to  use  observation  tests  to  evaluate  how  accurate  raters  are  likely  to  be  In  assessing 
performance. 


Although  evidence  points  to  a  number  of  Individual  differences  that  are  Important  In 
determining  measurement  quality,  It  would  seem  fruitless  to  Investigate  variables  that  are  not 
easily  changed  via  training.  This  does  not  hold  true  for  the  selection  of  WTPT  observers. 
Observers  may  be  selected  according  to  scores  on  Individual  differences  variables  found  to  be 
correlated  with  accuracy.  In  sum.  Individual  characteristics  may  not  be  Important  for  rating 
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methods  (unless  training  can  modify  them);  however,  they  may  form  the  basis  of  selection  of 
observers  In  the  WPTP  method. 


Other  variables  that  are  Important  to  the  accuracy  of  both  the  rater  and  the  observer  are 
their  knowledge  of  the  job  and  the  degree  of  acquaintance  with  the  job  incumbent.  Rater/observer 
understanding  of  the  job  is  clearly  Important  for  rating  quality.  The  degree  of  acquaintance 
between  rater  and  ratee  will  be  discussed  in  the  next  section. 


4.2  Rater-Ratee  Relationships 

Even  when  performance  measurement  is  for  research  purposes  only,  rater-ratee  relationships 
would  be  expected  to  Impact  the  quality  of  the  measures. 

From  the  literature  review,  it  is  apparent  that  the  rater’s  familiarity  with  the  ratee's 
previous  performance  (Jackson  &  Zedeck,  1982;  Scott  &  Hamner,  1975),  as  well  as  the  rater's 
acquaintance  with  the  ratee  (Freeberg,  1968),  will  impact  the  quality  of  the  measurement.  This 
would  seem  to  be  one  of  the  most  inportant  areas  of  needed  research  for  both  the  WTPT  method  and 
the  peer  and  supervisory  .  ating  methods.  For  the  WTPT  process,  it  means  that  observers  should 
probably  be  selected  who  are  not  knowledgeable  about  ratees'  previous  performance.  Degree  of 
acquaintance  is  an  area  needing  research.  It  appears  that  observers  need  to  become  acquainted 
with  the  ratee's  job  performance  in  terms  of  relevant  behaviors  and/or  accomplishments.  However, 
as  the  literature  Indicates,  too  much  acquaintance  with  the  ratee  can  lead  to  bias  In  the  ratings 
(typically  halo).  The  crucial  research  question  for  WTPT  observers  is  how  much  behavior  they 
need  to  observe  to  make  an  accurate  judgment  about  job  performance.  There  would  appear  to  be  an 
optimum  point  (or  a  range'  in  terms  of  the  amount  of  information  about  job  performance  required 
to  make  accurate  evaluations.  Too  little  Information  will  lead  to  errors  of  deficiency;  too  much 
information  may  lead  to  biases.  >• 

From  the  literature,  it  is  known  that  sex,  race,  and  age  can  impact  on  the  quality  of 
judgmental  data  (e.g.,  Bigoness,  1976;  Hamner  et  al.,  1974;  Nieva  &  Gutefc,  1980).  However,  not 
all  laboratory  or  field  studies  found  these  effects  (Cascio  &  Phillips,  1979);  in  field  studies 
such  as  those  of  Mobley  (1982)  and  Wendelken  &  Inn  (1981),  only  about  4X  of  the  variance  was  due 
to  these  effects.  Whether  the  same  effects  will  occur  when  the  measurement  is  made  for 
validation  purposes  is  unknown,  but  this  is  clearly  a  research  issue.  Possible  ways  to  minimize 
these  effects  are  through  training  and  through  effective  scale  construction  that  focuses  on 
relevant  nerformance  behaviors-  The  latter  approach  seems  an  appropriate  research  area  for  the 
Air  Force. 

For  the  WTPT  method,  it  will  be  necessary  to  determine  if  there  are  sex,  race,  or  age  effects 
on  the  quality  of  the  evaluations.  Further,  the  extent  of  any  observer- ratee  interaction  on 
these  variables  should  be  determined.  Finally,  for  both  the  ratings  and  the  WTPT  method,  it  will 
be  necessary  to  determine  If  there  are  sex  effects  on  measurement  quality  when  there  Is  sex  role 
incongruity,  particularly  for  females  in  non-traditional  job  specialties. 


4.3  Measurement  Method 


In  this  proposed  program  of  research,  recall  that  the  focus  is  on  the  following  measurement 
methods  of  job  performance:  (a)  supervisory  ratings,  (b)  peer  ratings,  (c)  self  ratings,  (d) 
WTPT,  and  (e)  objective  indices  of  productivity.  The  first  three  are  well  known.  WTPT,  a  new 
method  developed  specifically  for  this  project,  combines  both  observer  interviewing  and  hands-on 
methodologies.  Test  Items  will  be  constructed  to  evaluate  performance  on  the  required  tasks  for 
enlisted  specialties.  Test  administrators,  trained  to  use  these  scales,  will  examine  job 


incumbents  by  asking  them  to  perform  these  tasks,  or  answer  specific  content  questions  concerning 
the  tasks.  The  job  incumbent's  behavior  or  answers  will  then  be  recorded  on  a  rating  checklist. 
Thus,  this  method  will  examine  job  performance  at  a  micro  level. 

As  discussed  earlier,  it  is  believed  that  the  performance  criterion  space  for  many  jobs 
consists  of  two  major  parts—technical  competence  and  job-relevant  interpersonal  skills.  Both 
are  necessary  for  effective  job  performance.  However,  no  single  measurement  method  Is  able  to 
accurately  assess  both  parts  equally  well.  In  fact,  It  is  hypothesized  that  the  different 
measurement  methods  assess  different  parts  of  the  criterion  space  with  varying  degrees  of 
accuracy. 

The  first,  and  perhaps,  most  critical  research  task  is  to  specify  these  hypothesized 
relationships  among  the  measurement  methods  and  the  dimensions  of  job  performance  within  the 
criterion  space.  This  will  involve  the  generation  of  an  "expected"  correlation  matrix  within  the 
muHitrait-multimethod  scheme.  In  line  with  various  recommendations  In  the  scientific  literature 
(e.g.,  Feldman,  1981;  Nunnally,  1978)  that  a  priori  conceptualizations  must  drive  research 
programs,  this  is  seen  as  a  high  priority  task  early  in  the  research  program;  however,  empirical 
testing  of  this  nomological  net  will  not  occur  until  much  later  in  the  research  program. 
Obviously,  the  measurement  methods  must  be  developed  through  research  efforts  such  that  there  Is 
reasonable  certainty  that  the  methods  are  accurately  assessing  the  portion  of  the  criterion  space 
for  which  they  are  intended. 

This  research  task  would  involve  an  intensive  examination  of  the  literature  on  measurement 
methods  and  specific  dimensions  of  performance.  Although  we  covered  much  of  the  literature  In 
the  present  effort,  our  purpose  was  simply  to  Identify  those  measurement  methods  that  have  been 
used  with  success.  A  more  intensive  literature  search  would  be  needed  to  form  hypotheses  about 
what  specific  portions  of  the  criterion  space  are  being  accurately  measured  by  the  specific 
methods. 

The  completion  of  a  conceptual  framework  of  Interrelations  among  method-dimension  measures 
requires  that  a  related  research  Issue  be  investigated  simultaneously.  This  Issue  Involves 
determining  the  content  of  the  criterion  space  (e.g..  How  many  job  dimensions  constitute  this 
space,  and  is  there  agreement  on  a  general  set  of  performance  dimensions  that  is  common  across 
jobs  within  the  military?  Further,  are  there  additional  job  dimensions  that  are  common  within 
some  job  families  but  not  common  within  other  job  families?) 

The  literature  and  past  experience  suggest  there  are  a  set  of  common  job  dimensions  that 
apply  to  most  jobs,  and  that  successful  performance  on  these  comnon  dimensions  probably  changes 
across  job  families  and  job  levels.  For  example,  the  job-relevant  social  skills  for  an  aircraft 
mechanic  would  be  expected  to  differ  from  those  for  clerical  specialties  and  for  supervisory 
jobs.  Although  communications  skills  are  required  in  each,  the  job  behaviors  and  performance 
standards  would  be  different  for  the  three  jobs.  Kavanagh  et  al.  (1983)  found  this  pattern  of 
similar  dimensions  of  performance  (with  differing  content  within  dimensions)  across  job  families 
in  developing  a  performance  measurement  system  for  a  multi-hospital  corporation.  It  is 
reasonable  to  expect  to  find  such  a  pattern  in  the  development  of  the  performance  measurement 
methodology  for  the  Air  Force.  This  research  effort  will  be  concerned  with  reviewing  the 
literature  to  identify  an  initial  set  of  common  job  dimensions  that  can  be  used  to  generate  a 
nomological  net  of  relationships  among  method-dimension  measures. 

It  has  been  shown  that  each  of  the  five  methods  being  considered  within  this  research  program 
accurately  assesses  some  portion  of  the  criterion  space.  Thus,  none  of  these  methods  should  be 
abandoned  early  in  the  research  program;  rather,  research  should  be  aimed  at  determining  how  to 
refine  these  methods  so  that  they  achieve  a  high  degree  of  fidelity  in  measuring  their  respective 
portions  of  the  criterion  space.  It  is  important  that  these  methods  be  evaluated  within  an 
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accuracy  paradigm,  whether  that  be  via  multitrait-multimethod,  "paper  people,"  or  videotapes  with 
true  scores.  The  traditional  measures  of  measurement  quality  should  also  be  collected;  however, 
all  research  within  this  program  must  use  at  least  one  accuracy  measure  to  evaluate  the  quality 
of  measurement.  This  Is  not  only  Important  conceptually;  it  Is  crucial  If  one  wanes  to 
generalize  results  across  the  various  research  projects.  The  establishment  of  specific, 
acceptable  research  paradigms  using  an  accuracy  criterion  is  critical  to  the  success  of  this 
entire  program  of  research. 

Research  on  objective  Indices  of  individual  job  performance  should  proceed  somewhat 
differently  than  research  on  the  other  methods.  Objective  indices  in  this  case  refers  to  items 
such  as  administrative  counts  (e.g,,  court  martials,  absences,  inspection  records,  and  accident 
rates).  There  are  several  problems  with  these  Indices,  however.  They  have  low  ease  rates, 
suffer  from  criterion  deficiency,  are  unreliable  indices  of  true  events,  and  typically,  show  low 
variance.  This  does  not  mean  these  indices  should  be  ignored,  because  they  may  be  capturing  a 
piece  of  the  criterion  space;  however,  no  major  research  effort  on  these  measures  Is  warranted. 
Perhaps  when  the  other  methods  have  been  refined,  these  objective  Indices  will  become  part  of  a 
larger  data  collection  effort  and  prove  useful  for  comparison  purposes.  Finally,  by  closely 
coordinating  with  an  Army  Research  Institute  (ARI)  effort  that  Is  currently  examining  these 
objective  Indices,  AFHRL  may  save  considerable  time  and  effort  spent  on  aluating  objective 
measures. 

A  good  organizational  plan  would  be  to  treat  the  five  measurement  methods  as  separate  streams 
of  research,  each  with  decision  points  to  “stop,"  "modify,"  or  "continue."  Of  course,  there  will 
be  a  point  In  the  research  program  where  AFHRL  will  begin  to  Integrate  these  methods  into  a 
measurement  methodology  for  job  performance.  Thus,  one  could  envision  these  streams  of  research 
as  the  rows  In  a  matrix  describing  the  total  research  program,  with  the  columns  representing 
research  activities  or  tasks  within  the  streams.  One  could  then  establish  decision  points  within 
each  stream  as  to  the  quality  of  the  method. 


4.4  Measurement  Scale  Development 

The  following  prescriptive  guidance  exists  for  developing  rating  scales: 

1.  Both  raters  and  ratees  should  be  Involved  in  the  development  of  the  scale  content, 

2.  The  rating  scales  must  be  based  on  job  descriptions/requirements. 

3.  A  "critical  incidents"  approach  should  be  used  In  Initially  Identifying  the  relevant 
observable  behaviors  and  accomplishments  that  define  differing  levels  of  job  performance.  The 
special  strength  of  this  technique  Is  the  high  degree  of  content  validity  In  the  resultant  scales. 

4.  It  is  crucial  to  have  visible,  top  management  support  for  the  research  program. 


4,5  Scale  Characteristics 

This  Is  the  linkage  In  the  classification  scheme  about  which  the  most  empirical  research  and 
fairly  clear  prescriptive  advice  exist,  as  follows: 

1.  The  dimensions  of  performance  should  be  well  defined,  and  anchored  by  observable 
behaviors  or  accomplishments. 


2.  There  should  be  at  least  five  scale  points  for  the  rater/observer  to  make  a  judgment. 
Although  reliability  does  not  increase  much  beyond  five  scale  points,  having  a  larger  number  of 
points  may  Increase  raters'  acceptability  of  the  scale.  It  would  probably  be  wise  to  avoid  a 
nine-point  scale  because  of  its  similarity  to  the  current  operational  measurement  system.  Thus, 
a  five-  or  seven-point  scale  should  be  used.  And,  it  is  important  that  scales  across  the  methods 
use  the  same  number  of  scalar  points. 

3.  Each  scalar  point  needs  to  be  anchored  with  observables. 

4.  The  type  of  format  used  is  not  as  crucial  as  the  other  characteristics  described  above. 

5.  An  overall  measure  of  effectiveness  should  be  collected,  probably  at  the  end  of  the  form, 
after  the  individual  dimensions  of  performance  have  been  assessed.  This  should  be  done  with  the 
rating  methods  and  the  UTPT  method. 

In  terms  of  needed  research,  the  content  issue  raised  in  the  "Measurement  Methods"  section 
should  be  a  priority.  Identification  of  common  dimensions  across  jobs  and  a  general  mapping  of 
the  criterion  space  for  Air  Force  AFSs  will  be  important.  Research,  in  coordination  with  an 
intensive  literature  search  to  identify  common  performance  dimensions,  should  help  to  better 
identify  the  needed  content  for  the  measurement  scales.  Other  work  within  AFHRL,  such  as 
research  on  skill,  difficulty,  and  aptitude  requirements  for  various  AFSs  may  be  applicable  for 
identifying  the  performance  dimensions  needed  to  adequately  measure  the  criteron  space. 
Coordination  of  these  research  efforts  should  remain  an  AFHRL  responsibility. 

-•  •  * -  ‘  '  M" 

4.6  Performance  Standards 

Performance  standards  are  an  Important  part  of  this  measurement  methodology.  A  performance 
standard  establishes  a  specific  level,  in  terms  of  observables,  at  which  a  Job  Incumbent  must 
perform  to  be  classified  "acceptable,"  "unsatisfactory,"  etc.  Performance  standards  also  have 
important  implications  for  scale  development.  It  may  be  that  we  do  not  need  performance 
standards,  since  we  are  measuring  for  validation  purposes  only;  the  important  thing  here  is  to 
accurately  rank-order  job  Incumbents  according  to  their  performance  levels.  Performance 
standards  are  not  needed  to  accomplish  this.  On  the  other  hand,  use  of  standards  might  Increase 
user  acceptance  of  the  scales.  This  is  clearly  an  unresolved  Issue  which  needs  some  attention  in 
this  research  program. 

For  the  walk-through  testing  method,  the  research  issue  is  how  scales  are  to  be  developed 
such  that  observers  can  make  objective  judgments  about  a  person's  level  of  performance  on  Job 
tasks.  The  issue  becomes  one  of  generating  levels  of  performance  on  specific  tasks  that  can  be 
observed  and  coded  in  an  objective  manner  that  will  define  differing  levels  of  performance.  This 
appears  to  be  close  to  che  performance  standards  issue  raised  here,  and  the  resolution  may  be 
research  aimed  at  determining  which  group(s)  of  persons  (job  knowledge  experts,  etc.)  are  best 
qualified  to  provide  input  for  development  of  the  observer  scales  and  performance  standards. 


4.7  Social  Context 

From  the  limited  research  reviewed  in  this  linkage  of  the  schema  (Grey  4>  Kipnis,  1976; 
Knowlton  &  Mitchell,  1980;  Mitchell  &  Liden,  1982;  Rood  i  Mitchell,  1981),  It  appears  that  social 
context  contributes  to  error  variance.  Since  none  of  the  studies  involved  validation  research, 
it  may  be  that  the  effects  of  social  context  will  not  occur,  or  will  be  greatly  minimized.  In  a 
validation  effort.  Using  the  laboratory  paradigm  discussed  earlier  in  relation  to  age,  race,  and 
sex  effects,  the  effects  of  social  context  could  also  be  investigated.  This  would  seem  to  be  an 
important  research  project  for  the  Air  Force  program.  If  these  effects  are  present  in  a 


research-only  context,  then  research  must  determine  how  to  eliminate  these  errors  when  data  are 
collected  in  the  field.  Again,  this  may  be  accomplished  by  training,  or  more  simply,  by 

instructions. 

The  Importance  of  these  social  context  effects  may  be  most  relevant  to  the  peer  rating 
method,  both  with  respect  to  the  contrast  effect  and  the  biasing  of  job  skill  ratings  by  the 
ratee's  level  of  interpersonal  skills.  The  contrast  effect  occurs  when  a  ratee's  true 

performance  level  is  altered  because  those  with  whom  he/she  works  have  performance  levels 
different  from  his/hers.  For  example,  an  "average"  performer  may  be  rated  higher  if  other 

persons  in  his/her  work  group  are  low  performers.  This  has  been  found  for  supervisory  ratings, 
and  may  be  even  more  powerful  for  peer  ratings.  Clearly,  this  is  a  research  issue  that  needs 
investigation.  Another  biasing  effect  is  when  an  individual's  performance  evaluation  in  the 

technical  skills  area  is  altered  by  the  rater's  appraisal  of  his/her  interpersonal  skills.  Peer 
ratings  might  be  particularly  subject  to  this  bias,  and  research  needs  to  be  done  to  investigate 
whether  these  effects  will  occur  in  a  validation  research  situation. 

Interpersonal  skills  may  also  impact  the  evaluation  of  technical  skills  in  the  WTPT  method. 
This  possibility  could  be  easily  addressed  in  a  laboratory  study.  By  creating  an  experimental 
situation  where  the  ratees  are  approximately  equal  in  job  skills  but  quite  different  In  their 
Interpersonal  dealings  with  the  observer,  it  would  be  possible  to  assess  the  impact  of 
Interpersonal  skills  on  the  judgments  of  technical  competence.  If  the  WTPT  method  is  to  have  the 
highest  fidelity  of  all  methods  for  assessing  technical  skills,  then  there  should  be  research  to 
determine  if  the  interpersonal  interactions  between  the  job  incumbent  and  the  observer  are 
affecting  the  observer's  judgments.  If  this  is  so,  then  research  will  be  needed  to  help 
eliminate  these  errors. 


4.8  Won-Work  Variables 


Non-work  variables  Include  marital  status,  religion,  pre-school  children  in  the  family,  a 
working  spouse,  and  stress  events.  Based  on  the  job-related  stress  literature,  it  seems  clear 
that  these  variables  do  Inpact  on  job  performance,  but  it  is  not  at  all  clear  that  they  Influence 
ratings  of  job  performance.  These  variables  tend  to  be  taken  into  account  by  supervisors  in 
completing  their  ratings.  That  is,  when  there  art  non-work  events  that  have  a  temporary  effect 
on  an  individual's  job  performance,  raters  usually  adjust  for  these  factors.  One  way  to  minimize 
non-work  factors  Is  to  Instruct  the  raters  to  consider  the  person's  performance  over  a  longer 
period  of  time.  Past  experience  indicates  that  this  is  likely  to  "factor  out"  the  contaminating 
influences  of  non-work  variables;  however,  this  still  needs  empirical  testing.  Also,  this  may  be 
a  more  serious  problem  with  the  WTPT  method.  If,  for  example,  a  person  is  having  a  problem  at 
home  and  this  has  temporarily  reduced  his  or  her  job  performance,  the  observer  will  not  know  to 
adjust  for  that  factor.  It  may  be  necessary  for  the  observer  to  collect  information  from  the 
supervisor  on  these  possible  effects  before  collecting  job  performance  data. 


4.9  Performance  Constraints 


As  the  literature  indicates,  performance  constraints  can  also  affect  the  individual 
performance  of  job  Incumbents.  As  with  non-work  variables,  raters  typically  take  these 
constraints  (machines,  supplies,  forms,  etc.)  Into  consideration  when  evaluating  an  individual. 
These  factors  pose  the  most  serious  threat  to  the  WTPT  method.  Thus,  any  research  on  this  method 
should  take  these  variables  into  consideration. 
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4.10  Orqanization/Unit  Norms 


This  variable  has  a  particularly  large  impact  on  rating  quality  when  the  ratings  are  used  for 
administrative  purposes.  Though  this  may  not  be  an  issue  if  the  measures  are  collected  for 
research  validation  only,  it  will  be  crucial  that  the  purpose  of  the  ratings  be  clearly  and 
strongly  comminicated  to  the  raters.  This  may  become  troublesome  for  large-scale  data 
collection.  It  may  mean  the  use  of  a  public  relations  campaign  and  strongly  worded  instructions 
to  the  raters  that  the  data  are  being  collected  for  research  purposes  only.  This  may  be  not  so 
much  a  research  issue  as  a  practical  Issue. 

The  issue  of  effectively  communicating  the  purpose  also  impacts  on  the  packaging  of  the 
performance  rating  materials.  These  materials  need  to  be  packaged  and  presented  In  such  a  way 
that  the  purpose  and  procedures  are  completely  understandable  to  the  raters.  Frequently  error  is 
introduced  in  performance  appraisals  simply  because  raters  do  not  understand  the  instructions. 
There  is  little  evidence  on  this  issue  in  the  scientific  literature;  however,  it  is  believed  that 
it  can  be  a  major  source  of  error  in  the  performance  ratings. 


4. 1 1  Public  Relations/Administrative  Procedures 

Although  there  is  little  research  in  these  areas,  it  has  been  noted  several  times  In  this 
report  that  these  factors  play  an  important  role  when  large-scale  data  collection  is  being 
conducted  for  research  purposes.  This  should  be  an  area  of  continuing  concern  throughout  the 
developmental  phases  of  the  various  methods,  particularly  so  later  in  the  research  program. 


4.12  Rater  Training 

The  literature  rather  consistently  indicates  that  some  type  of  rater  training  Is  needed  to 
ensure  high  quality  ratings.  This  training  can  be  quite  elaborate,  involving  videotaping  and 
experiential  learning,  or  it  can  be  more  simple,  consisting  essentially  of  Instructions  to  avoid 
certain  rating  errors.  In  the  present  context,  it  may  be  possible  to  provide  orientation 
training  to  senior  enlisted  personnel  in  the  proper  use  of  the  rating  form.  It  may  be  that 
accurate  measures  can  be  obtained  with  relatively  simplistic  training.  This  Is  clearly  a 
research  issue,  and  one  that  needs  to  be  Investigated  using  accuracy  as  the  primary  dependent 
variable.  Obviously,  for  large-scale  data  collection,  using  on-site  trainers  with  relatively 
short  orientation  sessions  would  be  the  most  desirable  from  a  cost-effectiveness  standpoint. 

The  work  by  Hedge  (1982)  indicates  that  If  more  extensive  training  is  necessary  to  Increase 
accuracy,  alternative  training  approaches  to  traditional  “psychometric  error"  training  must  be 
Investigated.  This  effort  should  come  as  a  second  step  in  the  research  program.  For  example, 
the  following  rater  treatments  might  be  tested;  (a)  no  training,  instructions  only;  (b) 
orientation  training  and  instructions;  (c)  psychometric  error  training,  orientation,  and 
Instructions;  (d)  observation  training,  orientation,  and  instructions;  and  (e)  decision-making, 
orientation,  and  instructions.  Although  this  Is  a  rough  design.  It  emphasizes  the  point  that 
research  must  be  conducted  to  determine  which  of  these  or  other  procedures  can  best  Improve  the 
accuracy  of  the  ratings  for  validation  purposes.  This  research  can  use  either  "paper  people"  or 
videotapes  with  "true"  scores  in  a  laboratory  setting  for  the  Initial  work. 

As  discussed  previously,  any  factors  Identified  as  contributing  to  error  variance  in  the 
judgmental  data  must  be  reduced  through  some  countering  technique.  Training  Is  one  such 
technique,  as  are  Instructions  to  raters,  packaging  of  the  forms,  and  public  relations  efforts. 
It  will  require  some  careful  scientific  judgment,  based  on  available  resea* „h  results,  to  decide 
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which  of  these  techniques  will  best  counter  the  error  effects  of  these  factors.  Where  possible, 
research  studies  should  be  conducted  to  help  make  these  decisions. 

Two  other  Important  training  research  issues  need  to  be  investigated:  duration  of  training 
and  "wash-out"  effects.  In  terms  of  the  length  of  the  training  session,  there  is  little  in  the 
literature  to  suggest  a  minimum  amount  of  time  necessary  to  effect  substantial  improvement  in 
accuracy.  This  probably  depends  to  a  great  extent  on  the  content  of  the  training.  Training 
sessions  as  short  as  5  minutes  can  improve  the  quality  of  ratings  but  are  not  as  effective  as 

1- hour  training  sessions.  At  least  one  study  has  shown  that  accuracy  can  be  improved  with  a 

2- hour  training  session.  Obviously,  this  issue  is  far  from  settled,  and  it  is  a  critical  one  for 
large-scale  data  collection.  The  length  of  training  is  a  direct  cost  item.  Tangentially,  in 
terms  of  cost,  research  should  also  be  conducted  to  determine  if  on-site  enlisted  personnel  can 
be  used  as  trainers,  rather  than  using  professional  trainers.  Though  this  may  mean  the  creation 
of  training  manuals  and  sessions  to  train  the  trainers,  the  cost  savings  over  using  professional 
trainers  are  enormous. 


The  literature  documenting  wash-out  effects  has  shown  that  even  with  intensive  and  lengthy 
training  sessions,  raters  return  to  making  the  same  errors  they  did  prior  to  training.  There  has 
been  no  research  on  "booster"  or  refresher  training,  and  this  would  seem  to  be  an  Important 
research  issue.  This  research  could  easily  be  a  follow-on  study  to  the  training  research  study 
described  earlier.  The  experimental  subjects  would  simply  be  randomly  assigned  to  either  a 
refresher  or  no-refresher  training  condition  after  6  months.  Results  would  indicate  the 
usefulness  of  refresher  training  for  improving  rating  quality. 

Training  for  the  WTPT  observers  should  be  based  on  the  sources  of  error  variance  Identified 
in  previous  research  in  this  program.  Training  will  probably  need  to  be  intensive  to  bring 
observers  to  a  high  level  of  accuracy  in  their  observations  and  judgments. 


4. 13  Intervening  Variables 

In  the  classification  scheme  depicted  in  Figure  3  are  five  process  variables  which  may  Impact 
on  the  quality  of  the  measurement:  (a)  observation  heuristics.  Involving  the  Input  and  storage 
of  performance  information;  (b)  decision  heuristics.  Involving  weighing  information  and  making  a 
judgment  about  a  level  of  performance;  (c)  rater  motivation;  (d)  rater  trust  in  the  appraisal 
process;  and  (e)  acceptability  of  the  measurement  system. 

WitMn  the  performance  measurement  literature,  research  on  observational  processes  (Hedge, 
1982;  Murphy,  1982),  decision  processes  (Borman,  1977;  Hedge,  1982),  cognitive  processes 
(Feldman,  1981),  rater  motivation  and  trust  (Bernardln,  Orban,  4  Carlyle,  1981;  Bernardln  & 
Cardy,  1982),  and  acceptability  of  the  rating  system  (Dlpboye  &  dePontbrland,  1981;  Kavanagh  l> 
Hedge,  In  press;  Landy,  Barnes,  &  Murphy,  1978)  Indicates  these  can  be  Important  determinants  of 
rating  quality.  All  of  this  research  was  conducted  either  in  a  laboratory  or  a  field  setting 
using  operational  measures  of  job  performance.  The  extent  to  which  these  variables  will  operate 
In  a  system  used  for  research  purposes  only  Is  unknown  at  this  time.  It  would  appear  that 
laboratory  research  would  be  a  first  step  to  determine  If  these  variables  can  affect  the  accuracy 
of  the  measures.  It  would  then  be  critical  to  collect  field  data  in  follow-up  research.  These 
variables  are  seen  as  high  priority  items,  particularly  If  it  Is  possible  through  training  to 
alter  these  processes  and  significantly  Improve  the  accuracy  of  the  measures. 

4.14  Measurement  and  Research  Paradigm  Issues 

Several  Important,  methodological  Issues  which  seemed  to  be  quite  relevant  but  did  not  fit 
neatly  Into  any  of  the  linkages  of  the  schema  remain  to  be  discussed. 
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The  first  issue  concerns  the  ways  in  which  the  accuracy  of  the  measures  of  job  performance 
are  assessed.  Though  the  other  criteria  for  evaluating  the  quality  of  the  measure  (see  Table  1) 
will  typically  be  collected  in  all  research  projects,  the  crucial  issue  revolves  around  the 
selection  of  an  accuracy  approach.  The  four  approaches  to  the  accuracy/construct  validity 
criterion  are:  (a)  the  multitrait-multimethod  analysis  (Kavanagh  et  al.,  1971);  (b)  videotapes 
of  performance,  with  knowr  true  scores  (Borman,  Hough,  &  Dunnette,  1976);  (c)  "paper  people," 
with  known  true  scores  (Zedeck  &  Cascio,  1982);  and  (d)  specification  of  expected  score 
distributions  on  an  a  priori  basis,  as  discussed  in  the  introduction  of  this  report.  Two  obvious 
questions  are:  (a)  Are  they  the  same;  that  is,  will  the  same  conclusions  be  reached  about  the 
accuracy  of  the  measure  regardless  of  the  accuracy  approach  used?  and  (b)  Should  research  be 
conducted  within  a  laboratory  setting  to  evaluate  each  of  these? 

A  second  issue  concerns  the  "paper  people"  and  videotape  approaches.  Can  the  materials 
created  by  other  investigators  be  used  within  the  military  context  to  assess  the  accuracy  of 
ratings?  If  not,  then  research  must  be  started  to  create  new  videotapes  and/or  “paper  people" 
that  are  specific  to  the  military.  For  example,  it  may  be  necessary  to  create  videotapes  of 
aircraft  mechanics  engaged  in  job  behaviors  that  vary  in  terms  of  true  scores.  Or,  It  may  be 
necessary  to  create  videotapes  and/or  "paper  people"  depicting  military  supervisors  (or 
incumbents  in  other  jobs). 

The  third  issue  is  closely  allied  to  the  second.  If  the  decision  is  made  to  use  videotapes 
or  "paper  people"  created  by  other  investigators,  will  it  be  necessary  to  re-establish  true  score 
profiles  using  military  rater-experts?  Do  the  true  scores  for  these  materials  generalize  across 
organizations,  particularly  when  there  are  Important  differences  between  military  and 
non-military  settings?  Obviously,  if  materials  are  created  that  are  specific  to  the  military 
only,  true  scores  will  be  generated  by  military  rating  experts,  presumably  subject-matter  experts 
or  supervisors.  If  these  materials  are  created  for  military  jobs,  they  could  also  be  used  to 
evaluate  and  train  the  observers  in  the  WTPT  method. 

No  definitive  answers  to  these  questions/concerns  were  found  in  the  literature  reviewed. 
Some  research  will  be  necessary  to  assess  the  adequacy  of  videotape  versus  the  "paper  people" 
approach.  It  Is  the  opinion  of  the  authors  that  in  order  to  serve  as  viable  tools,  these 
videotapes  must  be  Air  Force  specific. 

The  last  issue  is  a  most  critical  one.  Careful  control  must  be  exercised  over  the  research 
conducted  in  this  program.  If  specific  approaches  to  the  accuracy  criterion  are  used,  they  need 
to  be  L'ssd  In  ill  rosoirch  studies*  This  would  5] so  hold  true  If  other  criteria  are  used  to 
evaluate  the  quality  of  the  measures  of  Job  performance.  Consistency  Is  critical  if  the  results 
of  these  separate  research  projects  are  to  be  combined  to  arrive  at  operational  decisions  about 
the  construction  of  the  measurement  methodology  for  this  total  effort.  Requiring  a  standard 
paradigm,  as  opposed  to  having  each  investigation  "re-invent  the  wheel,"  could  also  result  In 
significant  cost  savings.  In  sum,  this  argues  for  a  research  program  that  builds  on  earlier  work 
to  arrive  at  the  most  scientific  and  cost-effective  measurement  methodology  for  Job  performance 
in  the  mi  1 itary. 


V.  RESEARCH  PRIORITIES 

As  has  been  noted  several  times  in  this  report,  the  purpose  of  this  program  of  research  Is  to 
develop  a  measurement  methodology  for  job  performance  In  the  military.  This  research  focus  has 
guided  our  thinking  and  will  shape  the  research  priorities  that  are  established.  It  has  guided 
our  development  of  a  classification  scheme  and  our  selection  of  accuracy  as  the  principal 
dependent  measure.  It  is  believed  this  measure  is  necessary  if  one  is  to  choose  with  confidence 
the  best  criterion,  or  criteria,  to  validate  Air  Force  selection  and  classification  tests. 
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Having  underscored  the  purpose  a  d  focus,  what  follows  is  a  proposed  systematic  approach  to 
researching  the  key  issues  in  the  i  aasurement  of  job  performance.  While  this  section  is  Intended 
to  highlight  needed  research,  it  s  not  Intended  to  detail  specific  research  projects  for  each 
one  listed  in  Section  IV  (ReseaiJ,  Implications).  A  more  detailed  presentation  of  research 
issues  and  possible  solutions  is  presented  in  Appendix  A. 


5.1  Measurement  Methodology 

Decisions  about  the  development  and  use  of  various  measures  of  performance,  and  the  possible 
relationships  between  these  measures,  are  of  primary  concern.  Consequently,  a  number  of 
measurement  techniques  will  be  evaluated  in  terms  of  their  ability  to  measure  job  performance 
accurately.  Because  it  is  anticipated  that  no  single  available  measure  of  job  performance  will 
accurately  assess  the  entire  criterion  space,  a  top  research  priority  is  to  identify  which 
methods  accurately  measure  which  parts  of  the  criterion  space. 

Initial  efforts  in  this  area  require  &  priori  specification  of  the  nomologiral  network.  In 
addition,  part  of  this  effort  should  Include  a  more  detailed  look  at  the  dimensions  of 
performance  used  within  methods  across  studies,  across  methods  within  studies,  and  across 
dimensions  and  studies.  The  product  of  this  research  will  be  a  multimethod-multidimension  matrix 
tied  to  measurement  of  the  criterion  space.  This  line  of  research  supports  other  segments  of  the 
project  discussed  in  Section  5.3  (Research  Paradigm  Issues). 

A  related  research  issue  of  equally  high  Inportance  involves  empirically  testing  this 
hypothesized  nomological  net.  Once  the  desired  measurement  methods  are  developed  and  refined, 
the  postulated  relationships  can  be  experimentally  tested.  This  research  should  take  place  In 
the  later  stages  of  the  research  project. 

In  addition  to  the  multimethod-multidimension  criterion  space  research,  a  major  measurement 
methodology  research  effort  will  Involve  refinement  of  the  WTPT  technique.  Because  this  method 
is  still  In  the  early  stages  of  development,  and  Is  viewed  as  the  benchmark  and  high  fidelity 
component  of  the  measurement  process,  Initiation  of  this  research  Is  both  important  and  urgent. 
Associated  priority  research  Involves  the  development  of  the  performance  standards  scoring  key  to 
be  used  by  personnel  conducting  the  walk-through  testing.  This  key  is  an  Integral  part  of  the 
WTPT  methodology  and  must  be  developed  In  conjunction  with  the  technique  Itself. 

With  WTPT,  the  actual  combination  of  tasks  to  be  rated  may  he  unique  for  each  job.  Existing 
measures  of  task  difficulty  and  aptitude  requirements  may  need  to  be  used  to  equate  the  task 
ratings  so  Individuals'  scores  can  be  compared  on  the  same  scale.  Incumbents'  experience  ratings 
could  be  used  In  the  same  fashion. 

Research  should  also  focus  on  the  development  of  alternate  Job  performance  measures.  As 
previously  noted,  the  walk-through  testing  results  will  serve  as  the  reference  point  against 
which  more  global  and  less  expensive  measures  will  be  compared  to  select  the  measure(s)  to  be 
used  operationally. 


5.2  Scale  Development  and  Characteristics 

Research  on  scale  development  and  scale  characteristics  receives  a  high  priority  rating,  not 
because  of  the  amount  of  unsolved  questions,  but  because  of  the  urgency  associated  with  scale 
development.  Although  there  are  well-established  guidelines  (as  noted  In  Sections  III  and  IV),  a 
wide  range  of  alternative  Job  performance  measures  must  be  developed  In  order  to  compare 
different  performance  measures  with  the  walk-through  testing  procedure.  These  should  include 


peer,  supervisory,  and  self  performance  ratings,  and  should  range  from  ratings  of  general 
performance  to  highly  specialized,  task-specific  measures. 

Experience  ratings  need  to  be  obtained  from  Incumbents  to  help  moderate  confounding  effects. 
The  experience  measures  will  be  particularly  Important  to  moderate  the  ratings,  since  It  will  not 
be  possible  to  tailor  the  rating  forms  to  each  job  Incumbent,  At  best,  the  rating  forms  can  be 
written  for  job  types  within  specialties.  In  any  event,  the  specialties  should  be  the  same  AFSs 
as  those  used  In  the  walk-through  testing  approach. 

Research  conducted  during  this  developmental  phase  should  focus  on  Issues  such  as  the  number 
of  dimens ions/ Items  required  to  optimize  accuracy,  and  the  degree  of  content  overlap  across 
dimensions,  jobs,  and  specialties.  Although  this  work  Is  not  a  high  priority  or  the  Importance 
continuum,  the  need  to  develop  rating  scales  and  the  WTPT  method  In  conjunction  In  the  same  AFSs 
elevates  this  research  to  a  top  priority  on  the  urgency  continuum. 


Research  Paradigm  Issues 


A  major  research  focus  must  be  how  to  operationalize  and  apply  the  accuracy  criterion  as 
various  research  issues  are  confronted.  Though  other  criteria  (e.g.,  reliability,  psychometric 
effects)  will  be  collected,  the  crucial  issue  revolves  around  the  selection  of  an  accuracy 
approach.  As  noted  In  Section  IV,  there  are  at  least  four  main  paradigms  available 
( multi  trait-multimethod  analysis,  videotapes,  "paper  people,"  and  expected  score  distributions  on 
an  a  priori  basis). 


Whenever  possible,  a  combination  of  approaches  should  be  used  to  measure  accuracy.  Both  the 
1  P^orl  specification  of  expected  score  distributions  and  the  post  hoc  multltralt-multlmethod 
analyses  should  be  used  as  frequently  as  possible  because  of  their  direct  link  to  the 
hypothesized  nomologlcal  net,  and  the  eventual  empirical  testing  of  the  criterion  space 
conceptualization.  However,  It  Is  believed  that  the  best  single  approach  to  assessing  the 
accuracy  of  the  measuring  devices  Is  the  videotape  approach.  •  '  ' 


This  method  affords  several  advantages.  First,  because  this  approach  is  based  on  the 
development  of  scripts  depicting  varying  levels  of  performance  on  different  dimensions,  the  level 
of  performance  can  be  easily  manipulated  (as  can  variables  such  as  environmental  setting,  sex/age 
of  ratee,  type  of  task  being  viewed,  etc.).  Also,  normative  true  scores  will  be  generated,  and 
thus.  It  will  be  known  on  an  a  priori  basis  exactly  where  on  the  scale  a  rater  should  be  rating. 
In  addition,  the  level  of  specificity  (l.e.,  task,  job,  AFS)  of  each  tape  can  be  varied  depending 
on  the  purpose  and  focus  of  measurement. 


Videotapes  should  be  developed  for  a  number  of  AFSs,  using  Air  Force  personnel  or 
professional  actors  portraying  Air  Force  personnel.  Every  effort  should  be  made  to  make  the 
tapes  as  similar  to  on-the-job  conditions  as  possible.  Once  developed  and  validated,  data 
collection  will  be  accomplished  In  the  field  using  actual  military  raters. 


The  use  of  videotaped  vignettes  will  also  be  beneficial  In  answering  other  Important 
performance  measurement  research  Issues,  For  example,  In  deciding  on  the  type  of  training  to 
give  observers/ 1  ?ters  of  behavior,  the  videotapes  can  provide  the  standardized  performance 
against  which  to  judge  the  effectiveness  of  training.  Also,  videotapes  of  Individuals  being 
evaluated  In  a  WTPT  situation  would  provide  an  excellent  mechanism,  for  giving  observation 
training  to  test  administrators.  Other  specific  Issues  that  might  be  addressed  include:  (a)  the 
amount  of  observable  behavior  required  before  a  rater/observer  can  make  an  accurate  decision 
about  level  of  performance,  (b)  how  a  rater's  prior  knowledge  of  the  ratee  (job-related  or  not) 
affects  the  accuracy  of  ratings,  and  (c)  the  best  number  of  dimensions  or  items  to  be  used  with  a 


particular  measurement  method  In  order  to  optimize  the  ability  of  the  rater/observer  to  rate 
accurately. 

Finally,  the  _a  priori  specification  of  expected  score  distributions  and  the  multitrait- 
multimethod  construct  validity  analyses  should  also  be  included  whenever  possible,  to  gain 
additional  insight  into  the  accuracy  of  measurement.  Traditional  psychometric  measures  should 
also  be  included  in  any  data  collection  effort.  However,  the  videotape  approach  to  measurement 
accuracy  is  considered  a  cornerstone  of  this  entire  project  and  as  such,  is  rated  high  on  both 
urgency  and  importance. 

5.4  Identification  of  Possible  Sources  of  Error  Variance 

This  section  represents  an  attempt  to  "pull  together"  research  questions  whose  specific  focus 
is  the  identification  of  possible  sources  of  error  variance.  Most  of  these  issues  are  rated  no 
more  than  average  on  either  the  urgency  or  importance  continuum. 

A  large  part  of  the  discussion  in  Section  IV  that  dealt  with  variables  such  as  rater 
characteristics,  rater/ratee  relationships,  social  context,  non-work  variables,  performance 
constraints,  and  intervening  variables,  focused  on  issues  related  to  error  variance.  Therefore, 
pnly  a  few  issues  of  importance  will  be  discussed  here. 

For  example,  one  issue  of  concern  involves  the  purpose  of  our  measurement  --  validation 
research.  Because  the  purpose  for  which  ratings  are  to  be  collected  may  affect  the  degree  to 
which  a  rater  is  willing  to  provide  accurate  ratings,  it  may  be  that  raters  will  perceive 
validation  research  as  having  no  negative  consequences  for  them,  and  therefore,  provide  accurate 
ratings.  Thus,  the  variable  of  concern  becomes  one  of  the  rater's  trust  in  the  uses, 
consequences,  and  benefits  of  the  data  collection  effort. 

Another  issue  of  concern  involves  the  amount  of  behavior  observed  and  how  this  affects  a 
rater's  ability  to  rat*  accurately.  How  much  behavior  must  be  observed  before  one  can  be 
confident  that  the  ratings  are  relatively  accurate?  Is  there  a  point  of  diminishing  returns? 
These  types  of  questions  must  be  answered  so  that  the  rating  scales  developed,  raters  chosen,  and 
tne  amount  of  time  spent  gathering  Information  all  contribute  to  the  accuracy  of  the  measurement. 

Many  other  potential  sources  of  error  variance  must  also  be  accounted  for,  including 
rater/ratee  age,  sex,  and  race  congruence;  the  effects  of  various  performance  constraints  on  both 
raters  and  ratees;  and  the  effects  of  non-work  variables  on  job  performance.  These  represent 
only  a  subset  of  the  questions  tc  be  answered. 


5.5  The  Control  of  Error  Variance 

After  identifying  sources  of  error  variance,  research  must  deal  with  the  control  or 
elimination  of  at.  least  some  sources  of  error  variance.  Although  a  number  of  contributors  to 
error  variance  will  be  controlled  through  standardizing  procedures  (randomization,  equating, 
etc.),  a  major  research  effort  should  be  undertaken  under  the  general  heading  of  rater/observer 
training.  For  instance,  a  tralninn  program  (with  a  "public  relations"  focus)  could  be  developed 
that  is  aimed  at  increasing  the  accuracy  of  ratings  by  increasing  the  raters'  motivation  and 
trust  in  the  appraisal  process. 

A  much  larger  training  effort  should  be  initiated  by  the  second  o>  third  year  of  this 
project,  directed  toward  training  raters/observers  in  ways  that  will  increase  the  accuracy  of 
their  evaluations.  Training  to  iaprove  observational  skills  would  seem  rnost^ beneficial  for  HTPT, 
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while  other  types  of  training  may  be  required  for  the  supervisory,  peer,  and  self  rating 
methods.  Small  laboratory  and/or  field  pilot  studies  will  be  required  to  make  decisions 
regarding  length,  type,  and  content  of  training  prior  to  implementation. 

Training  is  one  of  the  major  techniques  that  will  be  introduced  to  reduce  error  variance.  In 
addition,  instructions  to  observers/raters,  packaging  of  the  forms,  and  public  relations  efforts 
should  be  used.  In  terms  of  the  scope  of  this  project,  training  is  important,  but  relatively 
less  urgent  than  identifying  sources  of  error  variance  and  developing  the  necessary  measurement 
methods. 


£.6  Final  Comments 

Much  of  the  Initial  research  suggested  here  should  also  be  repeated/refined  as  additional  Air 
Force  specialties  are  incorporated  into  the  research  effort,  particularly  in  the  areas  of  scale 
development,  WTPT  development,  and  development  of  the  videotapes.  As  information  and  knowledge 
are  gained  from  initial  work  in  these  areas,  it  is  anticipated  that  time  required  for  development 
will  be  significantly  reduced. 

In  this  program,  the  WTPT  technique  has  been  designated  as  the  benchmark,  high  fidelity 
method.  Consequently,  much  time  and  effort  will  need  to  be  focused  on  this  technique  in  order  to 
close  the  credibility  gap  between  actual  and  ultimate  criteria. 

The  choice  of  accuracy  as  our  measurement  quality  criterion  is  another  major  innovative 
approach  that  characterizes  this  program  of  research.  This  approach  Is  not  typical  of  past 
research  efforts  in  performance  measurement,  yet  recent  research  findings  have  begun  to  raise 
questions  concerning  the  adequacy  of  the  more  traditional  criteria  of  measurement  quality. 

Finally,  the  authors  consider  it  essential  that  a  job  performance  measurement  system 
development  effort  of  this  magnitude  be  undertaken  In  a  systematic  manner.  The  worth  of  the 
system  developed  may  be  ultimately  determined  by  the  willingness  of  project  personnel  to 
systematically  plan,  review,  and  evaluate  research  priorities  and  directions  jluring  the  course  of 
this  program  of  research.  , 
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APPENDIX  A:  RESEARCH  ISSUES 


The  specific  research  Issues  and  solutions  included  here  are  categorized 
according  to  the  headings  Identified  In  Section  IV  -  Research  Implications. 
Each  research  issue  Is  rated  on  nine-point  scales  of  Importance  and  urgency 
(1-most  Iraportant/urgent;  9-leaot  important/urgent).  In  addition,  estimated 
start  and  completion  dates  for  each  research  effort  are  Included  to  help 
Aground"  the  ratings  within  a  5-year  R&O  period. 
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INDIVIDUAL  CHARACTERISTICS 


I)  Research  focusing  on  the  trust  respondents/ratees 
have  In  the  measurement  system 
A)  Issues 

1)  Are  Instructions  on  the  rating  forms  adequate  5 
to  ensure  respondent  trust? 

2)  Is  a  training  program  necessary  to  ensure  5 

trust? 

3)  Who  could/should  administer  such  a  program  5 
(e.g.,  trained  scientists,  on-site  personnel)? 

4)  Mould  a  public  relations  campaign  be  as  5 

effective  as  other  means  to  Improve  trust? 


8  3  4 
8  3  4 
8  3  4 
8  3  4 


B)  Solutions 

1)  Some  of  this  research  can  be  tagged  on  to  other 
efforts  (and/or  be  In-house  generated);  e.g., 
questionnaire  administered  during  early  testing 
of  ratlngs/WTPT  procedures  to  determine  best  way 
to  convey  message  that  while  effort  Is  for 
research  purposes  only,  it  is  Important. 

2)  A  more  extensive  effort  would  Involve  an 
epxerlmental  design  manipulating  type  and  degree 
of  training,  type  of  person  administering  tests, 
and  then  measuring  trust 


II)  Research  focusing  on  the  individual  differences 
between  raters/observers 
A)  Issues 

1)  Can/should  WTPT  administrators  be  selected  4  10  5  5 

according  to  certain  Individual  difference 
criteria?  (In  other  words,  are  certain 
attrlbutes/abllltles  predictive  of 
observational  accuracy?) 
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INDIVIDUAL  CHARACTERISTICS  —  continued 
B)  Solutions 

1)  Some  Information  can  be  gained  from  existing 

“  files  (ASVA8  scores,  etc.) 

2)  Additional  information  can  be  obtained  by 
comparing  accuracy  of  different  types/groups  of 
raters/observers 

III)  Research  focusing  on  rater/observer's  understanding  of 
the  job 
A)  Issues 

1)  How  Important  is  the  rater/observer's  4423 

understanding  of  the  Job  to  ensuring  accuracy? 

2)  If  this  is  Important,  can  the  person  be  4423 

trained  to  be  more  knowledgeable  about  the 

job? 

3)  If  so,  how  does  this  relate  to  accuracy?  4423 


B)  Solutions 

1)  Best  way  —  use  developed  videotapes  with 

normative  target  scores  to  assess  accuracy  * 

of  raters  with  differing  amounts  of  knowledge 

of  the  job  (lab  and/or  field  setting) 

2)  Train  observers  on  job  content,  etc.  (maybe 
using  videotapes  of  WTPT  with  differing 
levels  of  proficiency,  varying  amount  of 
Informatlon/length  of  training 

a)  Once  aga  .  ev.-’uating  1.  terms  of  impact  on 
accuracy  (t;  feeds  <r<to  Rater/Ratee  II) 

RATER/RATEE  RELATIONSHIPS 

I)  Research  focusing  on  degree  of  acquaintance  between 
rater  and  ratee 
A)  Issues 

1)  What  Is  the  relationship  between  rater  &  ratee  5535 
acquaintance  and  accuracy  (uces  the  degree  of 

acquaintance  help  or  hinder  accuracy  --  maybe 
It's  an  Inverted  U  relationship)? 

2)  Related  to  this  question,  how  can  5535 

acquaintance/prior  knowledge  error  variance 

be  reduced?  •  • 
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RATER/RATEE  RELATIONSHIPS  —  continued 
7  B)  Solutions 

1)  Test  experimentally  by  manipulating  degree  of 
acquaintance  {determined  by  questionnaire)  & 
measuring  amount  of  error  variance  in  ratings 

2)  Use  videotapes  &  manipulate  degree  of 
acquaintance  by  amount  of  prior  information 
presented 

3)  In  relation  to  Issue  #2,  this  may  be  a  training 
question  &  can  be  tagged  on  to  other  research  on 
reduction  of  error  variance 

j.t-  ' 

-L  Ai 

II)  Research  focusing  on  amount  of  observable  behavior  -  " 

required 
A)  Issues 

1)  What  degree/amount  of  observable,  relevant  3322 
behavior  is  required  (how  much  is  necessary,  '-.J:.;  *'•' 

and  when  no  more  helps  in  terms  of  accuracy)? 

6)  Solutions  ’•  u 

1)  Post  hoc  data  analyses  (e.g.,  regression 
analysis)  to  help  determine,  for  Instance, 

how  many  dimens ions/ Items  are  required  11 

2)  Use  videotapes  &  manipulate  amount  of  Informa¬ 
tion  raters/observers  receive  and  see  how  It 
affects  accuracy 

3)  Just  ask  raters  how  long  it  took/how  many 
dimensions,  etc.  >’ 


III)  Research  focusing  on  rater/ratee  sources  of  error 
variance 

A)  Issues 

1)  Do  sex,  age,  race  effects  (and/or  5535 

Interactions)  exist  with  WTPT/rating  forms?  "A‘  ^  . 

B)  Solutions 

1)  Manipulate/control  these  variables  In  all 
possible  combinations  (tag-on  research) 


RATINGS 
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MEASUREMENT  METHODS 


I)  Research  focusing  on  the  relationships  among 
measurement  methods 

A)  Issues 

1)  A  priori  specification  of  the  relationship  1 

between  measurement  methods  and  the  dimensions 

of  job  performance  within  the  criterion  space 
(what  method  measures  what  piece  of  the 
criterion  space;  where  1$  the  overlap,  etc.) 

2)  Empirical  test  of  the  hypothesized  nomological  2 
network 

3)  The  WTPT  procedure  Is  envisioned  as  the  1 

benchmark  &  high  fidelity  component  of  the 
measurement  methods.  Therefore,  research 
concerning  refinement  of  approach,  etc.  will 

be  undertaken 

B)  Solutions 

1)  Literature  review  of  methods  used,  criterion 
space  measured,  etc.,  t  theoretical  development 
of  a  measurement  method  by  dimension  matrix 

2)  Look  at  uniqueness  of  dimensions  (l.e.,  when 
.  we're  validating  ASVAB,  do  we  need  to  validate 

It  against  some  sort  of  weighted  checklist,  or  a 
composite,  develop  a  synthetic  criterion,  etc.?) 
a)  possibly  use  a  policy-capturing  approach 
€)  Consequences 

MUCH  OF  THIS  TOTAL  RESEARCH  EFFORT  REVOLVES  AROUND 
THE  DETERMINATION  OF  THE  MOST  ACCURATE  MEASURES  OF 
DIFFERENT  ASPECTS  OF  PERFORMANCE,  AND  THUS,  THIS 
RESEARCH  IS  CRITICAL  TO  THE  OVERALL  SUCCESS  OF  THE 
EFFORT 


2  1  1 

10  5  5 

1  12 


.  f  ~ 


RATING  SCALE  DEVELOPMENT  .  . 

I)  Research  focusing  on  various  aspects  of  scale  .  • 

development  •  . 

A)  Issues 

1)  What  dimensions  should  be  used  (see  measurement  7113 
method  IA  &  B;  scale  characteristics  IA  t  6)? 

2}  Who  should  generate  critical  Incidents?  8113 

3}  Who  should  provide  scalar  points?  -8113 
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RATING  SCALE  DEVELOPMENT  —  continued 

B)  Solutions 

1)  See  characteristics  IB  ft  methods  IB 

C)  Consequences 

FAILURE  TO  PURSUE  THIS  LINE  OF  RESEARCH  HILL 
REDUCE  AF'S  ABILITY  TO  DETERMINE  THE  FIDELITY  OF 
MEASUREMENT  METHODS 


SCALE  CHARACTERISTICS 


I)  Research  focusing  on  scale  characteristics  that  may 
Impact  on  accuracy 

A)  Issues 

1)  How  many  dimensions  are  necessary  to  solve  the  7 
criterion  deficiency  problem?  (This  will  be  an 
Initial  effort  —  It  can  be  refined/ validated 
later);  how  many  dimensions/ Items  will  be 
required  to  optimize  accuracy? 

B)  Solutions 

1)  Factor  analysis  or  similar  statistical  procedure 
to  determine  number  of  dimensions 

?\  I  *t*r  «im  1  r  ir  j»  1  t«f« 

—  *  -  - r  '  *  *■"»* 

a)  with  videotapes  —  manipulating  number  of 
dimensions  on  rating  form  and  then  measuring 
accuracy 

b)  collect  field  data  using  rating  forms  with 
differing  numbers  of  dimensions  ft  compare 


1  11 


PERFORMANCE  STANDARDS 


I)  Research  focusing  on  performance  standards 
development  for  rating  forms 
A)  Issues 

1)  In  relation  to  the  development  of  Items/dlmen-  8 
sions/scales,  should  performance  standards  also 
be  developed? 

8)  Solutions 

1)  Relatively  unimportant  when  used  with  "for 
research  purposes  only*  paradigm  —  but  may  be 
tag-on  research  at  some  point  In  project 


9  4  5 
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PERFORMANCE  STANDARDS  —  continued 

II)  Research  focusing  on  the  development  of  performance 
standards  for  use  with  WTPT  procedures 

A)  Issues 

1)  In  the  development  of  performance  stan-  3  3  2  2 

dards  to  use  as  guidelines  to  rate  perfor¬ 
mance  using  WTPT  method,  the  Issue  Is:  who 
will  develop  these  standards/scoring  keys  and 
how  will  they  be  developed?  ■ 

B)  Solutions 

1)  Job  experts  should  develop  the  performance  v  ■  ' 

standards/scoring  keys  to  be  used  by  the  WTPT  ■  i  r~ 

administrators 

a)  development  must  ensure  accurate  scoring  ■>  .. 

of  observations  ■ 

C)  Consequences  ‘  • 

THE  ACCURACY/FIDELITY  OF  THIS  MEASUREMENT  METHOD 
HINGES  ON  THE  DEVELOPMENT  OF  STANDARDS  —  AND, 

LIKEWISE,  THE  ABILITY  TO  TIE  SELECTION,  TRAINING, 

ETC.  TO  HANDS-ON  PERFORMANCE  DEPENDS  ON  THE  • 

ACCURACY/FIDELITY  OF  THIS  MEASUREMENT  METHOD^  .fiisirv  •  .•>« 

ti  n?  v:  '.'i  1  •* ! *■  i 

SOCIAL  CONTEXT  ’  '  •  .  ~  . 


I)  Research  focusing  on  influences  of  social  context 
A)  Issues 

1)  What  effects  do  various  social  context  8945 

variables  have  on  measurement  accuracy  (which  of  „  v  v‘!>> 
these  variables  contribute  to  error  variance; 

e.g.,  contrast  errors)? 

2)  Do  Interpersonal  skills  Impact  on  Judgments  of  2  4  2  3 

technical  competence  for  rating  and  WTPT  methods  ‘ 

(criterion  contamination)?  .  w  •  .  < 

8)  Solutions  ,<  jt 

1)  Tag-on  research  —  see  Rater/Ratee  X I  IB  ■?  v 

2)  Manipulate  Interpersonal  skills  of  ratee  and 

measure  Influences  on  rater/observer  > 

b)  this  can  be  done  in  field,  or  with  vldeoi  ipes  i: 

■  wv.  i'i  .&■  '  )  rr  «'  • 

58 
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NON-WORK  VARIABLES 

_  J)  Research  focusing  on  non-work  variables  (family 

problems,  health,  etc.) 

A)  Issues 

1)  Do,  and  If  so,  how  do  non-work  variables  affect  9944 
performance,  and  consequently,  the  way  that 
overall  performance  Is  perceived? 

B)  Solutions 

1)  See  Rater/Ratee  III  B 

2)  In  terms  of  WTPT,  Interviewer  may  need  to  obtain 
Info  from  ratee's  supervisor  concerning  non-work 

variables  or  possibly  administer  a  questionnaire  ;  V  , 


PERFORMANCE  CONSTRAINTS 


i)  isa.-ch  focusing  on  performance  constraints  (e.g., 
'-.-s*  machines,  etc.)  that  Impact  on  individual 
•  .  o.  iance 

A)  $  B)  Issues  &  Solutions 
soe  non-work  variables 


7  .  9 


ORGANIZATION/UNIT  NORMS 


I)  Research  focusing  on  organization  and  unit  norms 
that  Impact  on  rating  quality 

A}  , -sues 

1)  How  to  best  approach  the  problem  of  selling  the  5  8  ,3  4 

data  collection  "for  research  purposes  only* 

(really  a  matter  of  breaking  the  organizational 
unit  rating  set) 

B)  Solutions 

1)  see  Individual  differences  I  A  &  B  •••$-■* V  riV: 


■a.  A’  ‘r 


.  .  rj'*  *.  4  r‘ 
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RATER  TRAINING 


I)  Research  focusing  on  training  raters/observers 
A)  Issues 

1)  It  seems  apparent  that  some  type  of  rater  2 

training  is  necessary 

a)  What  kind  of  training  is  necessary  in  order 
to  improve  accuracy? 

b)  Who  should  do  the  training? 

c)  What  should  the  content  of  training  be? 

d)  What  should  the  length  of  training  be? 

e)  Should  refresher  training  be  used?  " 

All  of  these  issues  should  be  evaluated  in 


terms  of  measurement  &  performance  accuracy 
2)  Observer  training  —  with  a  -  e,  same  as  above. 

B)  Solutions 

1)  Experimentally  test  out  different  types  of 
f  training— content,  length,  etc.  (manipulating 

each  variable) 

2)  Experimentally  test  which  type  of  training 

(psychometric  error,  observation, 

decision-making)  improves  accuracy  of  ratings 
and/or  observations 

a)  this  training  research  can  be  conducted  both 
in  the  field  with  real  rstlngs/observatipns  |> 
with  videotaped  vignettes  for  both  WTPT 
&  rating  approaches. 


C)  Consequences 


FAILURE  TO  PURSUE  THIS  LINE  OF  RESEARCH  WILL 


SEVERELY  REDUCE  THE  AIR  FORCE'S  ABILITY  TO 


OETERMINE  THE  ACCURACY/CONSTRUCT  VALIDITY  OF 


VARIOUS  CRITERION  MEASURES. 


INTERVENING  VARIABLES 

1)  Research  focusing  on  variables  that  intervene 

between  independent  variables  and  accuracy  of  ratings 
A)  Issues 

1)  Observation/decision  heuristics  were  discussed  4633 
in  Individual  Differences  section  I  A  &  II  A 

2)  Acceptability  of  system  by  ratees/raters  and  6834 
motivation  of  raters/observers  are  Important  In 

terms  of  system  accuracy 

■  -  A 
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INTERVENING  VARIABLES  —  continued 

B)  Solutions 

1)  See  A1 

2)  Assess  by  means  of  questionnaire! s)  the 
acceptability  of  the  system 


MEASUREMENT  &  RESEARCH  PARADIGM  ISSUES 


I)  Research  focusing  on  research  paradigms  to  use 

A)  Issues 

1)  Research  deciding  which  measures  should  be  1 

used  to  assess  accuracy/construct  validity 
(multltralt/multlmethod  construct  validity, 
paper  people,  videotape  vignettes,  a  priori 
specification) 

a)  should  we  use  more  than  one  (we  should  be 
at  least  consl stent/ systematic  to  some 
extent)? 

B)  Solutions 

1)  Development  of  videotapes  of  ratee  performance 
so  as  to  generate  normative  target  scores  t> 
assess  accuracy 

2)  In  house  a  priori  specifications  of  criterion 
space 

C)  Consequences 

FAILURE  TO  ADOPT  A  SYSTEMATIC  APPROACH  TO  MEASURING 

ACCURACY/CONSTRUCT  VALIDITY  WILL  UNDERMINE  THE 

WORTH  OF  THE  ENTIRE  PROJECT. 
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